## Running: ../01_annotated/01a_save_data_rds_lilly_bc.R
## New names:
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `lambda = { ... }`.
## ℹ In group 1: `compound_name = "1,11-Undecanedicarboxylate"`.
## Caused by warning:
## ! `cur_data()` was deprecated in dplyr 1.1.0.
## ℹ Please use `pick()` instead.
## # A tibble: 5 × 3
## step ZHP_metabolites RPNPF_metabolites
## <chr> <int> <int>
## 1 Initial 315 298
## 2 After ≤65% Missingness 258 213
## 3 After Pooled Noise-to-Signal ≤ 0.3 222 162
## 4 After CV(JMML) ≤ 0.3 & CV(Control) ≤ 0.3 178 155
## 5 Deduplicated shared metabolites by lowest m… 169 141
## Running: ../01_annotated/01b_missingness.R
## Saving 7 x 5 in imageSaving 7 x 5 in imageRunning: ../01_annotated/01c_CV_viz.R
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Running: ../01_annotated/02_fishers.R
## » Running Fisher tests for ZHP
## ✓ Results written to: ../res/annotated/ZHP_fisher.xlsx
## » Running Fisher tests for RPNPF
## ✓ Results written to: ../res/annotated/RPNPF_fisher.xlsx
## Running: ../01_annotated/03_QLIC_imputation.R
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## # A tibble: 5 × 16
## compound_name effect raw_p mean_Control mean_JMML log2_fc platform
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2-hydroxychrysene -10.3 6.12e-18 16.0 19.8 3.86 ZHP
## 2 2-Oxobutanoic acid 13.9 6.87e-10 NA NA NA ZHP
## 3 3-(4-Hydroxyphenyl)Py… 16.8 1.33e- 9 NA NA NA ZHP
## 4 Bisphenol AP 10.1 2.99e- 8 NA NA NA ZHP
## 5 Caffeic acid 16.8 1.33e- 9 NA NA NA ZHP
## # ℹ 9 more variables: test_type <chr>, a_Control_NA <int>,
## # b_Control_Present <int>, c_JMML_NA <int>, d_JMML_Present <int>,
## # ci_lower <dbl>, ci_upper <dbl>, log2_OR <dbl>, global_fdr <dbl>
## Running: ../01_annotated/04a_results_viz.R
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Saving 7 x 5 in image
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Saving 7 x 5 in image
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Saving 7 x 5 in image
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Saving 7 x 5 in image
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Saving 7 x 5 in image
## Warning: Removed 253 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 253 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Saving 7 x 5 in image
## Running: ../01_annotated/04b_pathway.R
##
## Running: ../01_annotated/05a_Sex_analysis.R
##
## Running: ../01_annotated/05b_Sex_analysis_viz.R
##
## Saving 10 x 5 in image
## Saving 10 x 5 in image
## Saving 10 x 5 in image
## Saving 10 x 5 in image
## Saving 10 x 5 in image
## Saving 10 x 5 in image
## Running: ../01_annotated/06a_Somatic_analysis.R
##
## New names:
## Running: ../01_annotated/06b_Somatic_analysis_viz.R
##
## Saving 10 x 5 in image
## Saving 10 x 5 in image
## Running: ../01_annotated/07a_venn_diagram.R
##
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Warning in grid.Call.graphics(C_polygon, x$x, x$y,
## list(as.integer(seq_along(x$x)))): semi-transparency is not supported on this
## device: reported only once per page
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Warning in grid.Call.graphics(C_polygon, x$x, x$y,
## list(as.integer(seq_along(x$x)))): semi-transparency is not supported on this
## device: reported only once per page
## Running: ../01_annotated/07b_consistency_scatterplot.R
## Finished main.R
Juvenile myelomonocytic leukemia (JMML) is a rare form of chronic leukemia (cancer of the blood) that affects children, commonly those aged four and younger. The name JMML now encompasses all diagnoses formerly referred to as juvenile chronic myeloid leukemia (JCML), chronic myelomonocytic leukemia of infancy, and infantile monosomy 7 syndrome. The average age of patients at diagnosis is two (2) years old. The World Health Organization has included JMML as a subcategory of myelodysplastic and myeloproliferative disorders. (From: https://en.wikipedia.org/wiki/Juvenile_myelomonocytic_leukemia)
Metabolomic profiles were measured from dry blood spots from newborn screening card of 58 JMML patients and 58 healthy controls. Two sets of mass-spectrometry data were generated for each of the 116 newborns, using different ways of separating the compounds:
an untargeted platform using reverse-phase negative with a PFAS-free kit (RPNPF) ionization modes
HILIC chromatography with positive ionization (ZHP).
Our primary goal in this analysis is biomarker discovery. We aim to use mass spectrometry data from dried blood spots to classify newborns as either JMML-positive or JMML-negative.
# ==============================================================================
# Script Name: save_data_rds.R
# Author: Steven Brooks
# Date: 2025-05-13
# Description: Load the files in, filter, and save to .rds files.
# ==============================================================================
# ==== Setup ===================================================================
library(janitor)
library(tidyverse)
library(readxl)
library(purrr)
if (!dir.exists("../res/annotated")) dir.create("../res/annotated", recursive = TRUE)
if (!dir.exists("../data/R_objects")) dir.create("../res/R_objects", recursive = TRUE)
# ==== Load data ===============================================================
# Annotated
raw_RPNPF <- readxl::read_xlsx("../data/Annotated_RPNPF_JMML_Meyer.xlsx")
raw_ZHP <- readxl::read_xlsx("../data/Annotated_ZHP_JMML_Meyer.xlsx")
raw_meta_info <- readxl::read_xlsx("../data/MetaFileJMML_annotated.xlsx")
#Clean column names
raw_RPNPF <- raw_RPNPF %>% janitor::clean_names()
raw_ZHP <- raw_ZHP %>% janitor::clean_names()
#For Ectoine: please use row 202
#For 5-Aminopentanoic acid: please use row 222
raw_ZHP <- raw_ZHP[-c(199, 238),]
# ==== ZHP Long format =========================================================
#Annotated
ZHP <- raw_ZHP %>% dplyr::select(-c(molecular_formula,mass,rt_min,rt_sec,adduct,mz,cas_id))
# Create sample group column
raw_meta_info <- raw_meta_info %>%
mutate(
sample_group = case_when(
SampleType == "sample" ~ case_when(
Sample_Type == "JMML" ~ "JMML",
Sample_Type == "Control" ~ "Control",
TRUE ~ "Blank"
),
TRUE ~ SampleType
)
)
# Pivot longer to tidy format
ZHP_long <- ZHP %>%
pivot_longer(
cols = -compound_name, # Keep compound_name as identifier
names_to = "SampleID", # Column to hold sample names
values_to = "intensity" # Column to hold the measurement values
)
# Ensure matching case for SampleID
raw_meta_info <- raw_meta_info %>%
mutate(SampleID_lower = tolower(SampleID))
ZHP_long <- ZHP_long %>%
mutate(SampleID_lower = tolower(SampleID)) %>%
left_join(
raw_meta_info %>% select(SampleID_lower, sample_group),
by = "SampleID_lower"
) %>%
select(-SampleID_lower) %>%
replace_na(list(sample_group = "blank"))
# ==== ZHP Filter 65% Missingness ==============================================
# Get compound names passing the missingness filter (only JMML & Control considered)
valid_compounds <- ZHP_long %>%
filter(sample_group %in% c("JMML", "Control")) %>%
group_by(compound_name) %>%
summarise(
prop_na_or_zero = mean(is.na(intensity) | intensity == 0),
.groups = "drop"
) %>%
filter(prop_na_or_zero <= 0.65) %>%
pull(compound_name)
# Step 4: Filter entire dataset to keep only those compounds
ZHP_NA_filtered <- ZHP_long %>%
filter(compound_name %in% valid_compounds)
# ==== ZHP Filter Noise-to-Signal ==============================================
ZHP_signal_filtered <- ZHP_NA_filtered %>%
group_by(compound_name) %>%
summarise(
median_matrix = median(intensity[sample_group == "matrix"], na.rm = TRUE),
median_pooledQC = median(intensity[sample_group == "PooledQC"], na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
noise_to_signal = median_matrix / median_pooledQC
) %>%
filter(
# Keep if median_pooledQC is NOT NA
!is.na(median_pooledQC),
# Keep if noise_to_signal is ≤ 0.3 OR median_matrix is NA
noise_to_signal <= 0.3 | is.na(median_matrix)
) %>%
select(compound_name)
# Filter original data to keep only good compounds
ZHP_NtS_NA_filtered <- ZHP_NA_filtered %>%
semi_join(ZHP_signal_filtered, by = "compound_name")
# ==== Hemoglobin Correlation ==================================================
# Load hemoglobin
hemoglobin <- raw_meta_info %>%
select(SampleID, sample_group, `Hemoglobin (mg/mL)`) %>%
mutate(SampleID = tolower(SampleID)) %>%
drop_na()
# Merge hemoglobin with filtered metabolite data
ZHP_with_hgb <- ZHP_NtS_NA_filtered %>%
filter(sample_group %in% c("JMML", "Control")) %>%
left_join(
hemoglobin %>% rename(hemoglobin_mg_per_mL = `Hemoglobin (mg/mL)`),
by = c("SampleID", "sample_group")
) %>%
filter(!is.na(hemoglobin_mg_per_mL), !is.na(intensity))
# Spearman's cor on raw intensity
spearman_hgb_cor_ZHP <- ZHP_with_hgb %>%
group_by(compound_name) %>%
summarise(
n = n(),
rho = cor(intensity, hemoglobin_mg_per_mL, method = "spearman"),
.groups = "drop"
)
#Histogram of rho values
p_hist_rho_ZHP <- ggplot(spearman_hgb_cor_ZHP, aes(x = rho)) +
geom_histogram(binwidth = 0.05, fill = "steelblue", color = "black", alpha = 0.8) +
labs(title = "Histogram of Spearman's ρ Values\nZHP",
x = "Spearman's ρ",
y = "Count") +
theme_bw()
# Save the histogram
ggsave(
filename = "../res/annotated/histogram_rho_ZHP.png",
plot = p_hist_rho_ZHP,
width = 7,
height = 5,
units = "in",
dpi = 300
)
# ==== ZHP Power Transformation ================================================
lambdas_QC <- ZHP_NtS_NA_filtered %>%
filter(sample_group %in% c("JMML", "Control")) %>%
group_by(compound_name) %>%
summarise(
lambda = {
model <- lm(intensity ~ 1, data = cur_data())
bc <- MASS::boxcox(model, lambda = seq(-2, 2, 0.1), plotit = FALSE)
lambda_opt <- bc$x[which.max(bc$y)]
lambda_opt
},
.groups = "drop"
)
# 2. Join λ and apply transformation only when intensity > 0
ZHP_transformed <- ZHP_NtS_NA_filtered %>%
filter(sample_group %in% c("JMML", "Control")) %>%
left_join(lambdas_QC, by = "compound_name") %>%
mutate(
intensity_bc = case_when(
is.na(intensity) | intensity <= 0 | is.na(lambda) ~ NA_real_,
abs(lambda) < 1e-8 ~ log(intensity),
TRUE ~ (intensity^lambda - 1) / lambda
)
)
head(ZHP_transformed)
# ==== ZHP Filter Coefficient of Variation =====================================
cv_tbl_ZHP <- ZHP_transformed %>%
filter(sample_group %in% c("JMML", "Control")) %>%
group_by(compound_name, sample_group) %>%
summarise(
mean_intensity_bc = mean(intensity_bc, na.rm = TRUE),
sd_intensity_bc = sd(intensity_bc, na.rm = TRUE),
cv = sd_intensity_bc / mean_intensity_bc,
.groups = "drop"
)
# Pivot all three summary columns to wide format
cv_wide_ZHP <- cv_tbl_ZHP %>%
pivot_wider(
names_from = sample_group,
values_from = c(mean_intensity_bc, sd_intensity_bc, cv),
names_sep = "_"
) %>%
filter(
!is.na(cv_Control) & cv_Control < 0.3,
!is.na(cv_JMML) & cv_JMML < 0.3
)
# Filter based on conditions
cv_filtered_compounds_ZHP <- cv_wide_ZHP %>%
pull(compound_name)
ZHP_final_filtered <- ZHP_NtS_NA_filtered %>%
semi_join(tibble(compound_name = cv_filtered_compounds_ZHP), by = "compound_name")
# ==== ZHP Filtering Summary Table =============================================
ZHP_filter_summary <- tibble(
step = c(
"Initial",
"After ≤65% Missingness",
"After Pooled Noise-to-Signal ≤ 0.3",
"After CV(JMML) ≤ 0.3 & CV(Control) ≤ 0.3"
),
n_metabolites = c(
n_distinct(ZHP_long$compound_name),
n_distinct(ZHP_NA_filtered$compound_name),
n_distinct(ZHP_NtS_NA_filtered$compound_name),
n_distinct(ZHP_final_filtered$compound_name)
)
)
# ==== RPNPF Long format =========================================================
#Annotated
RPNPF <- raw_RPNPF %>% select(-c(formula,mass,rt_min,rt_sec,adduct,m_z,cas_id))
# Create sample group column
raw_meta_info <- raw_meta_info %>%
mutate(
sample_group = case_when(
SampleType == "sample" ~ case_when(
Sample_Type == "JMML" ~ "JMML",
Sample_Type == "Control" ~ "Control",
TRUE ~ "Blank"
),
TRUE ~ SampleType
)
)
# Pivot longer to tidy format
RPNPF_long <- RPNPF %>%
pivot_longer(
cols = -compound_name, # Keep compound_name as identifier
names_to = "SampleID", # Column to hold sample names
values_to = "intensity" # Column to hold the measurement values
)
# Ensure matching case for SampleID
raw_meta_info <- raw_meta_info %>%
mutate(SampleID_lower = tolower(SampleID))
RPNPF_long <- RPNPF_long %>%
mutate(SampleID_lower = tolower(SampleID)) %>%
left_join(
raw_meta_info %>% select(SampleID_lower, sample_group),
by = "SampleID_lower"
) %>%
select(-SampleID_lower) %>%
replace_na(list(sample_group = "blank"))
# ==== RPNPF Filter 65% Missingness ==============================================
# Get compound names passing the missingness filter (only JMML & Control considered)
valid_compounds <- RPNPF_long %>%
filter(sample_group %in% c("JMML", "Control")) %>%
group_by(compound_name) %>%
summarise(
prop_na_or_zero = mean(is.na(intensity) | intensity == 0),
.groups = "drop"
) %>%
filter(prop_na_or_zero <= 0.65) %>%
pull(compound_name)
# Step 4: Filter entire dataset to keep only those compounds
RPNPF_NA_filtered <- RPNPF_long %>%
filter(compound_name %in% valid_compounds)
# ==== RPNPF Filter Noise-to-Signal ==============================================
RPNPF_signal_filtered <- RPNPF_NA_filtered %>%
group_by(compound_name) %>%
summarise(
median_matrix = median(intensity[sample_group == "matrix"], na.rm = TRUE),
median_pooledQC = median(intensity[sample_group == "PooledQC"], na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
noise_to_signal = median_matrix / median_pooledQC
) %>%
filter(
# Keep if median_pooledQC is NOT NA
!is.na(median_pooledQC),
# Keep if noise_to_signal is ≤ 0.3 OR median_matrix is NA
noise_to_signal <= 0.3 | is.na(median_matrix)
) %>%
select(compound_name)
# Filter original data to keep only good compounds
RPNPF_NtS_NA_filtered <- RPNPF_NA_filtered %>%
semi_join(RPNPF_signal_filtered, by = "compound_name")
# ==== Hemoglobin Correlation ==================================================
# Load hemoglobin
hemoglobin <- raw_meta_info %>%
select(SampleID, sample_group, `Hemoglobin (mg/mL)`) %>%
mutate(SampleID = tolower(SampleID)) %>%
drop_na()
# Merge hemoglobin with filtered metabolite data
RPNPF_with_hgb <- RPNPF_NtS_NA_filtered %>%
filter(sample_group %in% c("JMML", "Control")) %>%
left_join(
hemoglobin %>% rename(hemoglobin_mg_per_mL = `Hemoglobin (mg/mL)`),
by = c("SampleID", "sample_group")
) %>%
filter(!is.na(hemoglobin_mg_per_mL), !is.na(intensity))
spearman_hgb_cor_RPNPF <- RPNPF_with_hgb %>%
group_by(compound_name) %>%
summarise(
n = n(),
rho = cor(intensity, hemoglobin_mg_per_mL, method = "spearman"),
.groups = "drop"
)
p_hist_rho_RPNPF <- ggplot(spearman_hgb_cor_RPNPF, aes(x = rho)) +
geom_histogram(binwidth = 0.05, fill = "steelblue", color = "black", alpha = 0.8) +
labs(title = "Histogram of Spearman's ρ Values\nRPNPF",
x = "Spearman's ρ",
y = "Count") +
theme_bw()
# Save the histogram
ggsave(
filename = "../res/annotated/histogram_rho_RPNPF.png",
plot = p_hist_rho_RPNPF,
width = 7,
height = 5,
units = "in",
dpi = 300
)
# ==== RPNPF Power Transformation ================================================
lambdas_QC <- RPNPF_NtS_NA_filtered %>%
filter(sample_group %in% c("JMML", "Control")) %>%
group_by(compound_name) %>%
summarise(
lambda = {
model <- lm(intensity ~ 1, data = cur_data())
bc <- MASS::boxcox(model, lambda = seq(-2, 2, 0.1), plotit = FALSE)
lambda_opt <- bc$x[which.max(bc$y)]
lambda_opt
},
.groups = "drop"
)
# 2. Join λ and apply transformation only when intensity > 0
RPNPF_transformed <- RPNPF_NtS_NA_filtered %>%
filter(sample_group %in% c("JMML", "Control")) %>%
left_join(lambdas_QC, by = "compound_name") %>%
mutate(
intensity_bc = case_when(
is.na(intensity) | intensity <= 0 | is.na(lambda) ~ NA_real_,
abs(lambda) < 1e-8 ~ log(intensity),
TRUE ~ (intensity^lambda - 1) / lambda
)
)
head(RPNPF_transformed)
# ==== RPNPF Filter Coefficient of Variation =====================================
# Now compute CV for matrix and PooledQC:
cv_tbl_RPNPF <- RPNPF_transformed %>%
filter(sample_group %in% c("JMML", "Control")) %>%
group_by(compound_name, sample_group) %>%
summarise(
mean_intensity_bc = mean(intensity_bc, na.rm = TRUE),
sd_intensity_bc = sd(intensity_bc, na.rm = TRUE),
cv = sd_intensity_bc / mean_intensity_bc,
.groups = "drop"
)
# Pivot all three summary columns to wide format
cv_wide_RPNPF <- cv_tbl_RPNPF %>%
pivot_wider(
names_from = sample_group,
values_from = c(mean_intensity_bc, sd_intensity_bc, cv),
names_sep = "_"
) %>%
filter(
!is.na(cv_Control) & cv_Control < 0.3,
!is.na(cv_JMML) & cv_JMML < 0.3
)
# Filter based on conditions
cv_filtered_compounds_RPNPF <- cv_wide_RPNPF %>%
pull(compound_name)
RPNPF_final_filtered <- RPNPF_NtS_NA_filtered %>%
semi_join(tibble(compound_name = cv_filtered_compounds_RPNPF), by = "compound_name")
# ==== RPNPF Filtering Summary Table =============================================
RPNPF_filter_summary <- tibble(
step = c(
"Initial",
"After ≤65% Missingness",
"After Pooled Noise-to-Signal ≤ 0.3",
"After CV(JMML) ≤ 0.3 & CV(Control) ≤ 0.3"
),
n_metabolites = c(
n_distinct(RPNPF_long$compound_name),
n_distinct(RPNPF_NA_filtered$compound_name),
n_distinct(RPNPF_NtS_NA_filtered$compound_name),
n_distinct(RPNPF_final_filtered$compound_name)
)
)
# ==== CV for duplicate metabolites =============================================
#Are there duplicate metabolites?
shared_compounds <- intersect(
unique(RPNPF_final_filtered$compound_name),
unique(ZHP_final_filtered$compound_name)
)
length(shared_compounds) # number of overlapping compounds
shared_compounds # view the compound names
#Compare CVs for ZHP and RPNPF for duplicate metabolites
cv_wide_ZHP <- cv_wide_ZHP %>%
mutate(platform = "ZHP")
cv_wide_RPNPF <- cv_wide_RPNPF %>%
mutate(platform = "RPNPF")
cv_combined <- bind_rows(cv_wide_ZHP, cv_wide_RPNPF)
# Keep only shared compounds
shared_compounds <- intersect(cv_wide_RPNPF$compound_name, cv_wide_ZHP$compound_name)
cv_combined_shared <- cv_combined %>%
filter(compound_name %in% shared_compounds)
# pick the platform with the lower max CV across JMML and Control
cv_decision <- cv_combined_shared %>%
mutate(cv_max = pmax(cv_JMML, cv_Control, na.rm = TRUE)) %>%
group_by(compound_name) %>%
slice_min(cv_max, with_ties = FALSE) %>%
ungroup()
writexl::write_xlsx(cv_decision, "../res/annotated/cv_compare_selected.xlsx")
# Get compounds assigned to ZHP (to remove from RPNPF)
keep_ZHP <- cv_decision %>%
filter(platform == "ZHP") %>%
pull(compound_name)
# Remove those compounds from RPNPF
RPNPF_filtered_nodup <- RPNPF_final_filtered %>%
filter(!(compound_name %in% keep_ZHP))
### Same idea as above
keep_RPNPF <- cv_decision %>%
filter(platform == "RPNPF") %>%
pull(compound_name)
# Remove those compounds from RPNPF
ZHP_filtered_nodup <- ZHP_final_filtered %>%
filter(!(compound_name %in% keep_RPNPF))
# ==== Hemoglobin Table =======================================================
# Add platform labels
spearman_hgb_cor_RPNPF <- spearman_hgb_cor_RPNPF %>%
mutate(platform = "RPNPF")
spearman_hgb_cor_ZHP <- spearman_hgb_cor_ZHP %>%
mutate(platform = "ZHP")
combined_hgb_cor <- bind_rows(spearman_hgb_cor_ZHP, spearman_hgb_cor_RPNPF)
writexl::write_xlsx(combined_hgb_cor, "../res/annotated/hemoglobin_correlated_metabolites.xlsx")
# ==== Save data ==============================================================
saveRDS(ZHP_filtered_nodup, file = "../data/R_objects/ZHP_annotated_filtered.rds")
saveRDS(RPNPF_filtered_nodup, file = "../data/R_objects/RPNPF_annotated_filtered.rds")
filter_summary <- ZHP_filter_summary %>%
rename(ZHP_metabolites = n_metabolites) %>%
left_join(RPNPF_filter_summary %>% rename(RPNPF_metabolites = n_metabolites),
by = "step")
filter_summary <- bind_rows(
filter_summary,
tibble(
step = "Deduplicated shared metabolites by lowest max CV (JMML, Control)",
ZHP_metabolites = n_distinct(ZHP_filtered_nodup$compound_name),
RPNPF_metabolites = n_distinct(RPNPF_filtered_nodup$compound_name)
)
)
writexl::write_xlsx(filter_summary, "../res/annotated/filter_summary.xlsx")
print(filter_summary)# Missingness ZHP
library(naniar)
# assumes this just ran:
#source("01a_save_data_rds_lilly_bc.R")
# ==== ZHP Missingness =========================================================
# Filter and reshape
ZHP_wide <- ZHP_long %>%
filter(sample_group %in% c("JMML", "Control")) %>%
pivot_wider(
names_from = compound_name,
values_from = intensity
)
# Reorder columns to fix display (optional)
ZHP_wide <- ZHP_wide %>%
arrange(SampleID) %>%
column_to_rownames("SampleID")
#### ==== Missingness Heatmap ==============================================
p_miss_clean_ZHP <- vis_miss(ZHP_wide, sort_miss = FALSE) +
labs(
title = "ZHP Metabolite Missingness (JMML + Control)",
subtitle = "Each row = sample, each column = metabolite",
x = NULL,
y = NULL
) +
theme_bw(base_size = 11) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.position = "bottom"
)
# Step 4: Save
ggsave(
filename = "../res/annotated/QC/ZHP_missingness_heatmap.png",
plot = p_miss_clean_ZHP,
width = 10,
height = 6,
dpi = 300
)
# Recompute missingness
ZHP_missingness_summary <- ZHP_long %>%
filter(sample_group %in% c("JMML", "Control")) %>%
group_by(compound_name) %>%
summarise(
prop_missing_or_zero = mean(is.na(intensity) | intensity == 0),
.groups = "drop"
)
# Threshold range
thresholds <- seq(-0.01, 1, by = 0.01)
# Compute number removed at each threshold
ZHP_removed_curve <- tibble(
threshold = thresholds,
n_removed = map_int(thresholds, function(t) {
sum(ZHP_missingness_summary$prop_missing_or_zero > t)
})
)
M3_tick <- .65 + (1 - .65)/2
M2_tick <- .2 + (.65 - .2)/2
M1_tick <- .1
a <- 0.01/2
# Plot
p_removed_ZHP <- ggplot(ZHP_removed_curve, aes(x = threshold, y = n_removed)) +
geom_line(color = "black") +
geom_vline(xintercept = c(0.20, 0.65), linetype = "dashed", color = "red") +
# Region annotations
annotate("text", x = 0.10, y = max(ZHP_removed_curve$n_removed) * 0.75,
label = "T-Test on\nImputed Values", size = 4, color = "gray20") +
annotate("text", x = 0.425, y = max(ZHP_removed_curve$n_removed) * 0.5,
label = "Fisher's Exact Test\non Missingness", size = 4, color = "gray20") +
annotate("text", x = 0.825, y = max(ZHP_removed_curve$n_removed) * 0.5,
label = "Removed from Analysis", size = 4, color = "gray20") +
# Custom x-axis ticks
scale_x_continuous(
name = "Missingness Threshold",
breaks = c(0, 0.1, 0.2, 0.425, 0.65, 0.825, 1.0),
labels = c("0", "M1", "0.2", "M2", "0.65", "M3", "1.0")
) +
labs(
title = "Number of Metabolites Classified by Missingness Threshold - ZHP",
y = "Number of Metabolites Removed"
) +
theme_bw(base_size = 12)
# Save to PNG
ggsave(
"../res/annotated/QC/ZHP_missingness_threshold_excluded.png",
plot = p_removed_ZHP,
width = 7,
height = 5,
dpi = 300
)
a <- 0.01/2
# Plot
p_removed_ZHP_ttest <- ggplot(ZHP_removed_curve, aes(x = threshold, y = n_removed)) +
geom_line(color = "black") +
geom_vline(xintercept = c(0.20, 0.65), linetype = "dashed", color = "red") +
# Region annotations
annotate("text", x = 0.10, y = max(ZHP_removed_curve$n_removed) * 0.75,
label = "T-Test on\nImputed Values", size = 4, color = "gray20") +
annotate("text", x = 0.425, y = max(ZHP_removed_curve$n_removed) * 0.5,
label = "Fisher's Exact Test\non Missingness", size = 4, color = "gray20") +
annotate("text", x = 0.825, y = max(ZHP_removed_curve$n_removed) * 0.5,
label = "Removed from Analysis", size = 4, color = "gray20") +
#Color the part you want
geom_rect(aes(xmin = 0, xmax = 0.20, ymin = -Inf, ymax = Inf),
fill = "dodgerblue", alpha = a) +
# geom_rect(aes(xmin = 0.20, xmax = 0.65, ymin = -Inf, ymax = Inf),
# fill = "darkorange", alpha = a) +
# geom_rect(aes(xmin = 0.65, xmax = 1.0, ymin = -Inf, ymax = Inf),
# fill = "gray60", alpha = a) +
# Custom x-axis ticks
scale_x_continuous(
name = "Missingness Threshold",
breaks = c(0, 0.1, 0.2, 0.425, 0.65, 0.825, 1.0),
labels = c("0", "M1", "0.2", "M2", "0.65", "M3", "1.0")
) +
labs(
title = "Number of Metabolites Classified by Missingness Threshold - ZHP",
y = "Number of Metabolites Removed"
) +
theme_bw(base_size = 12)
ggsave("../res/annotated/QC/missingness_ZHP_ttest.png", plot = p_removed_ZHP_ttest)
# Plot
p_removed_ZHP_fisher <- ggplot(ZHP_removed_curve, aes(x = threshold, y = n_removed)) +
geom_line(color = "black") +
geom_vline(xintercept = c(0.20, 0.65), linetype = "dashed", color = "red") +
# Region annotations
annotate("text", x = 0.10, y = max(ZHP_removed_curve$n_removed) * 0.75,
label = "T-Test on\nImputed Values", size = 4, color = "gray20") +
annotate("text", x = 0.425, y = max(ZHP_removed_curve$n_removed) * 0.5,
label = "Fisher's Exact Test\non Missingness", size = 4, color = "gray20") +
annotate("text", x = 0.825, y = max(ZHP_removed_curve$n_removed) * 0.5,
label = "Removed from Analysis", size = 4, color = "gray20") +
#Color the part you want
# geom_rect(aes(xmin = 0, xmax = 0.20, ymin = -Inf, ymax = Inf),
# fill = "dodgerblue", alpha = a) +
geom_rect(aes(xmin = 0.20, xmax = 0.65, ymin = -Inf, ymax = Inf),
fill = "dodgerblue", alpha = a) +
# geom_rect(aes(xmin = 0.65, xmax = 1.0, ymin = -Inf, ymax = Inf),
# fill = "gray60", alpha = a) +
# Custom x-axis ticks
scale_x_continuous(
name = "Missingness Threshold",
breaks = c(0, 0.1, 0.2, 0.425, 0.65, 0.825, 1.0),
labels = c("0", "M1", "0.2", "M2", "0.65", "M3", "1.0")
) +
labs(
title = "Number of Metabolites Classified by Missingness Threshold - ZHP",
y = "Number of Metabolites Removed"
) +
theme_bw(base_size = 12)
ggsave("../res/annotated/QC/missingness_ZHP_fisher.png", plot = p_removed_ZHP_fisher)
# ==== RPNPF Missingness =========================================================
# Filter and reshape
RPNPF_wide <- RPNPF_long %>%
filter(sample_group %in% c("JMML", "Control")) %>%
pivot_wider(
names_from = compound_name,
values_from = intensity
)
# Reorder columns to fix display (optional)
RPNPF_wide <- RPNPF_wide %>%
arrange(SampleID) %>%
column_to_rownames("SampleID")
#### ==== Missingness Heatmap ==============================================
p_miss_clean_RPNPF <- vis_miss(RPNPF_wide, sort_miss = FALSE) +
labs(
title = "RPNPF Metabolite Missingness (JMML + Control)",
subtitle = "Each row = sample, each column = metabolite",
x = NULL,
y = NULL
) +
theme_bw(base_size = 11) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.position = "bottom"
)
# Save
ggsave(
filename = "../res/annotated/QC/RPNPF_missingness_heatmap.png",
plot = p_miss_clean_RPNPF,
width = 10,
height = 6,
dpi = 300
)
# Recompute missingness
RPNPF_missingness_summary <- RPNPF_long %>%
filter(sample_group %in% c("JMML", "Control")) %>%
group_by(compound_name) %>%
summarise(
prop_missing_or_zero = mean(is.na(intensity) | intensity == 0),
.groups = "drop"
)
# Threshold range
thresholds <- seq(-0.01, 1, by = 0.01)
# Compute number removed at each threshold
RPNPF_removed_curve <- tibble(
threshold = thresholds,
n_removed = map_int(thresholds, function(t) {
sum(RPNPF_missingness_summary$prop_missing_or_zero > t)
})
)
M3_tick <- .65 + (1 - .65)/2
M2_tick <- .2 + (.65 - .2)/2
M1_tick <- .1
a <- 0.01/2
# Plot
p_removed_RPNPF <- ggplot(RPNPF_removed_curve, aes(x = threshold, y = n_removed)) +
geom_line(color = "black") +
geom_vline(xintercept = c(0.20, 0.65), linetype = "dashed", color = "red") +
# Region annotations
annotate("text", x = 0.10, y = max(RPNPF_removed_curve$n_removed) * 0.75,
label = "T-Test on\nImputed Values", size = 4, color = "gray20") +
annotate("text", x = 0.425, y = max(RPNPF_removed_curve$n_removed) * 0.5,
label = "Fisher's Exact Test\non Missingness", size = 4, color = "gray20") +
annotate("text", x = 0.825, y = max(RPNPF_removed_curve$n_removed) * 0.5,
label = "Removed from Analysis", size = 4, color = "gray20") +
# Custom x-axis ticks
scale_x_continuous(
name = "Missingness Threshold",
breaks = c(0, 0.1, 0.2, 0.425, 0.65, 0.825, 1.0),
labels = c("0", "M1", "0.2", "M2", "0.65", "M3", "1.0")
) +
labs(
title = "Number of Metabolites Classified by Missingness Threshold - RPNPF",
y = "Number of Metabolites Removed"
) +
theme_bw(base_size = 12)
# Save to PNG
ggsave(
"../res/annotated/QC/RPNPF_missingness_threshold_excluded.png",
plot = p_removed_RPNPF,
width = 7,
height = 5,
dpi = 300
)
a <- 0.01/2
# Plot
p_removed_RPNPF_ttest <- ggplot(RPNPF_removed_curve, aes(x = threshold, y = n_removed)) +
geom_line(color = "black") +
geom_vline(xintercept = c(0.20, 0.65), linetype = "dashed", color = "red") +
# Region annotations
annotate("text", x = 0.10, y = max(RPNPF_removed_curve$n_removed) * 0.75,
label = "T-Test on\nImputed Values", size = 4, color = "gray20") +
annotate("text", x = 0.425, y = max(RPNPF_removed_curve$n_removed) * 0.5,
label = "Fisher's Exact Test\non Missingness", size = 4, color = "gray20") +
annotate("text", x = 0.825, y = max(RPNPF_removed_curve$n_removed) * 0.5,
label = "Removed from Analysis", size = 4, color = "gray20") +
#Color the part you want
geom_rect(aes(xmin = 0, xmax = 0.20, ymin = -Inf, ymax = Inf),
fill = "dodgerblue", alpha = a) +
# Custom x-axis ticks
scale_x_continuous(
name = "Missingness Threshold",
breaks = c(0, 0.1, 0.2, 0.425, 0.65, 0.825, 1.0),
labels = c("0", "M1", "0.2", "M2", "0.65", "M3", "1.0")
) +
labs(
title = "Number of Metabolites Classified by Missingness Threshold - RPNPF",
y = "Number of Metabolites Removed"
) +
theme_bw(base_size = 12)
# Plot
p_removed_RPNPF_fisher <- ggplot(RPNPF_removed_curve, aes(x = threshold, y = n_removed)) +
geom_line(color = "black") +
geom_vline(xintercept = c(0.20, 0.65), linetype = "dashed", color = "red") +
# Region annotations
annotate("text", x = 0.10, y = max(RPNPF_removed_curve$n_removed) * 0.75,
label = "T-Test on\nImputed Values", size = 4, color = "gray20") +
annotate("text", x = 0.425, y = max(RPNPF_removed_curve$n_removed) * 0.5,
label = "Fisher's Exact Test\non Missingness", size = 4, color = "gray20") +
annotate("text", x = 0.825, y = max(RPNPF_removed_curve$n_removed) * 0.5,
label = "Removed from Analysis", size = 4, color = "gray20") +
#Color the part you want
geom_rect(aes(xmin = 0.20, xmax = 0.65, ymin = -Inf, ymax = Inf),
fill = "dodgerblue", alpha = a) +
# Custom x-axis ticks
scale_x_continuous(
name = "Missingness Threshold",
breaks = c(0, 0.1, 0.2, 0.425, 0.65, 0.825, 1.0),
labels = c("0", "M1", "0.2", "M2", "0.65", "M3", "1.0")
) +
labs(
title = "Number of Metabolites Classified by Missingness Threshold - RPNPF",
y = "Number of Metabolites Removed"
) +
theme_bw(base_size = 12)The first step for our workflow is to segment our data based on missing values for each metabolite, based on on the JMML and Control groups.
Each metabolite is going to fall into one of three groups:
M1 Missing less than 20%: We are going to pass this to a T-Test, after imputing the values.
M2 Missing between 20% - 65% : We are going to apply Fisher’s exact test to assess JMML-related missingness.
M3 Missing more than 65%+: Remove the metabolite completely from the analysis.
Goal: Remove Metabolites with more than 65%+ NA values in JMML and Control Groups.
Goal: Remove Metabolites with a Noise-to-Signal Ratio greater than 0.3
To identify potential confounding by blood volume, we examine the relationship between metabolite intensity and hemoglobin concentration (mg/mL), which serves as a proxy for total blood in the dried spot.
If a metabolite’s intensity is strongly correlated with hemoglobin, it may reflect differences in sample volume rather than true biological variation. In such cases, apparent enrichment could be an artifact of having more blood rather than more of the metabolite.
We use Spearman’s correlation to assess this relationship, as it captures monotonic trends without assuming linearity or normal distributions—ideal for high-variance metabolomics data.
Values near zero indicate little association with hemoglobin and are unlikely to be affected by blood volume. Positive ρ values suggest metabolite intensity rises with hemoglobin, potentially reflecting volume rather than biology.
The distribution is centered slightly above 0, which suggests some metabolites track with modestly hemoglobin, such as:
## # A tibble: 1 × 4
## compound_name n rho platform
## <chr> <dbl> <dbl> <chr>
## 1 18:2 PE (1,2-dilinoleoyl-sn-glycero-3-phosphoethanolamin… 116 0.554 RPNPF
However, we will leave it in for the analysis.
Goal: Remove Metabolites with a Coefficient of Variation greater than 0.3.
The coefficient of variation (CV) is a way to measure how much a set of data varies compared to its average value. It’s calculated by dividing the standard deviation (how spread out the values are) by the mean (the average value), and then multiplying by 100 to get a percentage.
If the CV is high, it means there’s a lot of variation relative to the average.
If the CV is low, it means the values are pretty consistent compared to the average.
Coefficient of Variation assumes a normal distribution. We are going to transform our metabolite intensities to a normal distribution using the box-cox transform. We will use the box-cox intensities to filter by the coefficient of variation.
To avoid confounding due to group-specific noise, we filter within each group (JMML and Control) and retain only metabolites that are stable in both.
# 05_cv_distribution_viz.R
# ─────────────────────────
# Visualize CV of Box–Cox–transformed intensities, showing kept vs filtered-out compounds
# assumes this just ran:
#source("01a_save_data_rds_lilly_bc.R")
# ==== ZHP =========================================
# ==== Plot Coefficient of Variation =========================================
# Create a histogram to visualize the distribution of CV values
p_cv_hist_JMML_ZHP <- ggplot(cv_tbl_ZHP %>% filter(sample_group == "JMML"), aes(x = cv)) +
geom_histogram(binwidth = 0.01, fill = "#E74C3C", color = "black", alpha = 0.8) +
geom_vline(xintercept = 0.3, linetype = "dashed", color = "blue", size = 1) + # Vertical line at CV = 0.3
labs(title = "Histogram of Coefficient of Variation (CV) Before Filtering",
subtitle = "JMML Samples | Box-Cox Transformed | ZHP",
x = "Coefficient of Variation (CV)",
y = "Count") +
theme_bw()
# Save the histogram
ggsave(
filename = "../res/annotated/QC/histogram_cv_before_filtering_JMML_ZHP.png",
plot = p_cv_hist_JMML_ZHP,
width = 7,
height = 5,
units = "in",
dpi = 300
)
# ==== Plot Coefficient of Variation =========================================
# Create a histogram to visualize the distribution of CV values
p_cv_hist_Control_ZHP <- ggplot(cv_tbl_ZHP %>% filter(sample_group == "Control"), aes(x = cv)) +
geom_histogram(binwidth = 0.01, fill = "#3498DB", color = "black", alpha = 0.8) +
geom_vline(xintercept = 0.3, linetype = "dashed", color = "blue", size = 1) + # Vertical line at CV = 0.3
labs(title = "Histogram of Coefficient of Variation (CV) Before Filtering",
subtitle = "Control Samples | Box-Cox Transformed | ZHP",
x = "Coefficient of Variation (CV)",
y = "Count") +
theme_bw()
# Save the histogram
ggsave(
filename = "../res/annotated/QC/histogram_cv_before_filtering_Control_ZHP.png",
plot = p_cv_hist_Control_ZHP,
width = 7,
height = 5,
units = "in",
dpi = 300
)
## ==== RPNPF =========================================
# ==== Plot Coefficient of Variation =========================================
# Create a histogram to visualize the distribution of CV values
p_cv_hist_JMML_RPNPF <- ggplot(cv_tbl_RPNPF %>% filter(sample_group == "JMML"), aes(x = cv)) +
geom_histogram(binwidth = 0.01, fill = "#E74C3C", color = "black", alpha = 0.8) +
geom_vline(xintercept = 0.3, linetype = "dashed", color = "blue", size = 1) + # Vertical line at CV = 0.3
labs(title = "Histogram of Coefficient of Variation (CV) Before Filtering",
subtitle = "JMML Samples | Box-Cox Transformed | RPNPF",
x = "Coefficient of Variation (CV)",
y = "Count") +
theme_bw()
# Save the histogram
ggsave(
filename = "../res/annotated/QC/histogram_cv_before_filtering_JMML_RPNPF.png",
plot = p_cv_hist_JMML_RPNPF,
width = 7,
height = 5,
units = "in",
dpi = 300
)
# ==== Plot Coefficient of Variation =========================================
# Create a histogram to visualize the distribution of CV values
p_cv_hist_Control_RPNPF <- ggplot(cv_tbl_RPNPF %>% filter(sample_group == "Control"), aes(x = cv)) +
geom_histogram(binwidth = 0.01, fill = "#3498DB", color = "black", alpha = 0.8) +
geom_vline(xintercept = 0.3, linetype = "dashed", color = "blue", size = 1) + # Vertical line at CV = 0.3
labs(title = "Histogram of Coefficient of Variation (CV) Before Filtering",
subtitle = "Control Samples | Box-Cox Transformed | RPNPF",
x = "Coefficient of Variation (CV)",
y = "Count") +
theme_bw()
# Save the histogram
ggsave(
filename = "../res/annotated/QC/histogram_cv_before_filtering_Control_RPNPF.png",
plot = p_cv_hist_Control_RPNPF,
width = 7,
height = 5,
units = "in",
dpi = 300
)We have 23 metabolites detected by both ZHP and RPNPF.
We will compare each shared metabolite’s CV for both JMML and Control groups.
We will choose the platform the the lower maximum CV across the JMML and Control groups.
This ensures each metabolite is represented only once in downstream analyses, with preference given to the platform offering lower within-group variability.
# ==============================================================================
# Script Name: fishers.R (refactor 2025-06-20)
# Author: Steven Brooks
# Purpose: Fisher’s exact test on compounds with 20–65 % missingness
# ==============================================================================
# ---- Setup -------------------------------------------------------------------
library(tidyverse)
library(writexl)
# ---- IO helpers --------------------------------------------------------------
run_fisher <- function(data,
dataset_name,
min_miss = 0.20,
max_miss = 0.65,
out_path = "../res/annotated") {
message("» Running Fisher tests for ", dataset_name)
# ensure output directory exists
if (!dir.exists(out_path)) {
dir.create(out_path, recursive = TRUE)
}
## 1. keep JMML & Control and tag missingness -------------------------------
miss_tbl <- data %>%
filter(sample_group %in% c("JMML", "Control")) %>%
mutate(missing = if_else(is.na(intensity) | intensity == 0, "NA", "Present"))
## 2. filter compounds in given missingness window --------------------------
keep_vec <- miss_tbl %>%
group_by(compound_name) %>%
summarise(prop_missing = mean(missing == "NA"), .groups = "drop") %>%
filter(between(prop_missing, min_miss, max_miss)) %>%
pull(compound_name)
miss_tbl <- miss_tbl %>% filter(compound_name %in% keep_vec)
## 3. count → wide (JMML_NA etc.) -------------------------------------------
fisher_df <- miss_tbl %>%
count(compound_name, sample_group, missing) %>%
pivot_wider(
names_from = c(sample_group, missing),
values_from = n,
values_fill = 0,
names_sep = "_") %>%
# ensure columns exist even if all 0
rowwise() %>%
mutate(
## 4. define a, b, c, d ---------------------------------------------------
a_Control_NA = Control_NA,
b_Control_Present = Control_Present,
c_JMML_NA = JMML_NA,
d_JMML_Present = JMML_Present,
## 5. Fisher test --------------------------------------------------------
fisher_result = list(fisher.test(matrix(c(a_Control_NA,
b_Control_Present,
c_JMML_NA,
d_JMML_Present),
nrow = 2, byrow = TRUE,
dimnames = list(
Group = c("Control","JMML"),
Status = c("Missing","Present")
)))),
odds_ratio = fisher_result$estimate[[1]],
ci_lower = fisher_result$conf.int[1],
ci_upper = fisher_result$conf.int[2],
fisher_pval = fisher_result$p.value,
log2_OR = log2(odds_ratio)
) %>%
ungroup() %>%
select(compound_name, a_Control_NA,
b_Control_Present,
c_JMML_NA,
d_JMML_Present,
odds_ratio, ci_lower, ci_upper, log2_OR, fisher_pval)
## 6. save -------------------------------------------------------------------
out_file <- file.path(out_path,
paste0(dataset_name, "_fisher.xlsx"))
writexl::write_xlsx(fisher_df, out_file)
message("✓ Results written to: ", out_file)
invisible(fisher_df)
}
# ---- Load data ---------------------------------------------------------------
ZHP <- readRDS("../data/R_objects/ZHP_annotated_filtered.rds")
RPNPF <- readRDS("../data/R_objects/RPNPF_annotated_filtered.rds")
# ---- Run ---------------------------------------------------------------------
fisher_results_ZHP <- run_fisher(ZHP, "ZHP")
fisher_results_RPNPF <- run_fisher(RPNPF, "RPNPF")Next, we apply Fisher’s exact test betweem JMML and Control groups on the 20% - 65% Missing Metabolties. No imputation is performed on these metabolites. We wish to identify differential missingness patterns based on JMML status.
Let’s walk through what we are doing by looking closer at 2-Oxobutanoic acid:
Contingency table for 2-Oxobutanoic acid detection
| Absent (NA) | Present | |
|---|---|---|
| Control | 42 | 16 |
| JMML | 9 | 49 |
Detection frequencies
Odds ratio (OR)
OR = (49/9) / (16/42) ≈ 14.3
This is the ratio of the odds of detection in JMML versus Control.
Log₂-transformed OR log2(14.3) ≈ 3.79
A positive value means the metabolite is detected more often in JMML. Specifically, a log₂(OR) of 3.79 corresponds to an OR ≈ 14, so the odds of seeing 2-Oxobutanoic acid are roughly fourteen‐fold higher in JMML samples than in controls.
Table Legend
compound_name: Name of the metabolite
effect: Effect size (e.g., odds ratio for Fisher’s test or t-statistic for t-tests)
raw_p: Unadjusted p-value from the statistical test
platform: Analytical platform used (ZHP or RPNPF)
test_type: Type of test performed (Fisher for presence/absence, TTest for abundance)
a_Control_NA: # of control samples with missing/zero values
b_Control_Present: # of control samples with detected values
c_JMML_NA: # of JMML samples with missing/zero values
d_JMML_Present: # of JMML samples with detected values
ci_lower, ci_upper: Lower and upper bounds of the 95% confidence interval for the odds ratio
log2_OR: Log₂-transformed odds ratio (positive = enriched in JMML; negative = enriched in Control)
global_fdr: Adjusted p-value using Benjamini-Hochberg correction
direction: Interpretation of enrichment (e.g., "Detected more in JMML")
neglog10_p: –log₁₀(raw p-value), useful for visualizing significance
sig: Significance status after FDR correction
We apply QRLIC imputation (Quantile Regression Imputation of Left-Censored data) on the metabolites with less than 20% missingness in the combined JMML and Control groups.
# ==============================================================================
# Script Name: QLIC_imputation.R
# Author: Steven Brooks
# Date: 2025-06-04
# Description: Perform QLIC on the least 20% missing values.
# The idea is to:
# 1) log2(x+1) the raw intensities.
# 2) Perform QRLIC on the log-transformed intensities.
# 3) Invert the log transformation on the imputed intensities (2^QRLIC(x))
# 4) Box-cox the 'raw' imputed intensities.
### Also, this script assumes both RPNPF and ZHP do not have zero values.
# We have to change the logic for the unannotated analysis.
# ==============================================================================
# ==== Setup ===================================================================
library(tidyverse)
library(readxl)
library(imputeLCMD)
library(writexl)
# ==== Load data ===============================================================
ZHP <- readRDS(file = "../data/R_objects/ZHP_annotated_filtered.rds")
RPNPF <- readRDS(file = "../data/R_objects/RPNPF_annotated_filtered.rds")
# ==== Process ZHP =============================================================
## ==== QRILC imputation on compounds with <20% missingness =====================
ZHP_filtered <- ZHP %>%
filter(sample_group %in% c("JMML", "Control")) %>%
mutate(sample_group = factor(sample_group, levels = c("Control", "JMML")))
compound_missingness <- ZHP_filtered %>%
group_by(compound_name) %>%
summarize(
total = n(),
zeros = sum(intensity == 0, na.rm = TRUE),
NAs = sum(is.na(intensity)),
missing = sum(is.na(intensity) | intensity == 0),
missing_pct = missing / total
) %>%
filter(missing_pct < 0.2)
ZHP_clean <- ZHP_filtered %>%
filter(compound_name %in% compound_missingness$compound_name)
ZHP_wide <- ZHP_clean %>%
select(SampleID, compound_name, intensity) %>%
mutate(
intensity_log2 = if_else(
is.na(intensity) | intensity == 0,
NA_real_,
log2(intensity + 1)
)
) %>%
select(SampleID, compound_name, intensity_log2) %>%
pivot_wider(names_from = compound_name, values_from = intensity_log2)
# Drop SampleID for imputation
intensity_matrix <- ZHP_wide %>% select(-SampleID) %>% as.matrix()
# Apply QRLIC imputation
set.seed(123)
imputed_matrix <- impute.QRILC(intensity_matrix)[[1]]
# Set any negative imputed values to zero
imputed_matrix[imputed_matrix < 0] <- 0
# Combine back with SampleID
ZHP_imputed <- bind_cols(SampleID = ZHP_wide$SampleID, as.data.frame(imputed_matrix))
inv_ZHP_imputed <- ZHP_imputed %>%
mutate(across(-SampleID, ~ 2^.x - 1))
## ==== Box-Cox Transform ZHP =================================================
# Prepare sample group info
group_info <- ZHP_filtered %>%
filter(SampleID %in% ZHP_imputed$SampleID) %>%
distinct(SampleID, sample_group)
# Join group info to imputed data
imputed_long_ZHP <- inv_ZHP_imputed %>%
pivot_longer(-SampleID, names_to = "compound_name", values_to = "intensity") %>%
left_join(group_info, by = "SampleID") %>%
filter(sample_group %in% c("JMML", "Control")) %>%
mutate(sample_group = factor(sample_group, levels = c("Control", "JMML")))
# Box-cox transform before t.test:
lambdas <- imputed_long_ZHP %>%
group_by(compound_name) %>%
summarise(
lambda = {
model <- lm(intensity ~ 1, data = cur_data())
bc <- MASS::boxcox(model, lambda = seq(-2, 2, 0.1), plotit = FALSE)
lambda_opt <- bc$x[which.max(bc$y)]
lambda_opt
},
.groups = "drop"
)
# 2. Join λ and apply transformation only when intensity > 0
ZHP_transformed <- imputed_long_ZHP %>%
left_join(lambdas, by = "compound_name") %>%
mutate(
intensity_bc = case_when(
is.na(intensity) | intensity <= 0 | is.na(lambda) ~ NA_real_,
abs(lambda) < 1e-8 ~ log(intensity),
TRUE ~ (intensity^lambda - 1) / lambda
)
)
#Save the imputed raw intensities, log2() intensities, and box-cox intensities
ZHP_transformed <- ZHP_transformed %>%
mutate(intensity_log2 = log2(intensity + 1))
## ==== t.test ZHP ============================================================
# Calculate means by group
group_means <- ZHP_transformed %>%
group_by(compound_name, sample_group) %>%
summarize(mean_intensity = mean(intensity_log2), .groups = "drop") %>%
pivot_wider(names_from = sample_group, values_from = mean_intensity, names_prefix = "mean_")
# Run t-tests
ttest_results_ZHP <- ZHP_transformed %>%
group_by(compound_name) %>%
summarize(
t_test = list(t.test(intensity_bc ~ sample_group)),
.groups = "drop"
) %>%
mutate(
t_statistic = map_dbl(t_test, ~ .x$statistic),
p_value = map_dbl(t_test, ~ .x$p.value)
) %>%
select(compound_name, t_statistic, p_value) %>%
left_join(group_means, by = "compound_name") %>%
mutate(log2_fc = mean_JMML - mean_Control)
## ------- ZHP PCA Before imputation --------------------------------------------
# PCA Validation of Normalization and Imputation Procedures
#
# PCA is performed on both unimputed and imputed datasets to assess the impact
# of data preprocessing on overall structure and variance.
#
# Step 1: Normalize the unimputed data
# - Normalization is applied to raw intensity values (e.g., log2, Box-Cox).
# - PCA is run with samples as rows and metabolites as columns.
# - This serves as a baseline before any imputation is applied.
# Box-cox transform before t.test:
lambdas_unimp <- ZHP_clean %>%
group_by(compound_name) %>%
summarise(
lambda = {
model <- lm(intensity ~ 1, data = cur_data())
bc <- MASS::boxcox(model, lambda = seq(-2, 2, 0.1), plotit = FALSE)
bc$x[which.max(bc$y)]
},
.groups = "drop"
)
# 2. Join λ and apply Box–Cox only when intensity > 0
ZHP_unimp_transformed <- ZHP_clean %>%
left_join(lambdas_unimp, by = "compound_name") %>%
mutate(
intensity_bc = case_when(
# missing or non-positive intensities → NA
is.na(intensity) | intensity <= 0 | is.na(lambda) ~ NA_real_,
# λ ≈ 0 → log transform
abs(lambda) < 1e-8 ~ log(intensity),
# otherwise Box–Cox
TRUE ~ (intensity^lambda - 1) / lambda
)
)
# Step 2: Impute missing values
# - Missing values are imputed using [insert method, e.g., KNN or QRILC].
# Step 3: Normalize the imputed data (if required by method)
# - Normalization is re-applied if the imputation method alters data scale.
# See above code: QLIC + Box-Cox
head(ZHP_transformed)
# Step 4: PCA on the imputed dataset
# - PCA is rerun on the normalized, imputed data.
# 1. Pivot to wide, then make SampleID the rownames before dropping metadata
ZHP_unimp_pca_mat <- ZHP_unimp_transformed %>%
select(SampleID, sample_group, compound_name, intensity_bc) %>%
pivot_wider(
names_from = compound_name,
values_from = intensity_bc
) %>%
column_to_rownames(var = "SampleID") %>% # ← move SampleID into rownames
select(-sample_group) %>% # ← drop sample_group once rownames set
as.matrix()
# 2. Clean columns with any NA or infinite
keep_cols <- colSums(is.na(ZHP_unimp_pca_mat) | is.infinite(ZHP_unimp_pca_mat)) == 0
pca_unimp_matrix_clean <- ZHP_unimp_pca_mat[, keep_cols]
# 3. Run PCA
pca_unimp_res <- prcomp(pca_unimp_matrix_clean, center = TRUE, scale. = TRUE)
# 4. Extract scores (they’ll carry rownames = your SampleIDs)
pca_scores_unimp <- as.data.frame(pca_unimp_res$x) %>%
rownames_to_column(var = "SampleID")
# 5. Re-attach sample_group (if you need it for plotting)
pca_scores_unimp <- pca_scores_unimp %>%
left_join(
ZHP_unimp_transformed %>%
distinct(SampleID, sample_group),
by = "SampleID"
)
p_before_ZHP <- ggplot(pca_scores_unimp, aes(PC1, PC2, color = sample_group)) +
geom_point(size = 3, alpha = 0.8) +
labs(
x = paste0("PC1 (", round(summary(pca_unimp_res)$importance[2,1]*100,1), "%)"),
y = paste0("PC2 (", round(summary(pca_unimp_res)$importance[2,2]*100,1), "%)"),
color = "Sample Group",
title = "PCA Before QLIC Imputation\non Box-Cox Transformed Intensities (ZHP)"
) +
theme_minimal(base_size = 12)
print(p_before_ZHP)
if (!dir.exists("../res/annotated/QC")) dir.create("../res/annotated/QC", recursive = TRUE)
# 5. Save plot to PNG
ggsave(
filename = "../res/annotated/QC/PCA_boxcox_before_imputation_ZHP.png",
plot = p_before_ZHP,
width = 6,
height = 5,
dpi = 300
)
# Purpose:
# - Compare clustering and variance explained (PC1, PC2, etc.) before and after imputation.
# - Assess whether group structure (e.g., JMML vs. Control) is preserved or clarified.
# - Ensure imputation does not introduce artificial patterns or mask biological signal.
## ------- ZHP PCA After imputation ---------------------------------------------
# Pivot to wide, then make SampleID the rownames before dropping metadata
ZHP_imp_pca_mat <- ZHP_transformed %>%
select(SampleID, sample_group, compound_name, intensity_bc) %>%
pivot_wider(
names_from = compound_name,
values_from = intensity_bc
) %>%
column_to_rownames(var = "SampleID") %>% # ← move SampleID into rownames
select(-sample_group) %>% # ← drop sample_group once rownames set
as.matrix()
# Clean columns with any NA or infinite
keep_cols <- colSums(is.na(ZHP_imp_pca_mat) | is.infinite(ZHP_imp_pca_mat)) == 0
pca_imp_matrix_clean <- ZHP_imp_pca_mat[, keep_cols]
# Run PCA
pca_imp_res <- prcomp(pca_imp_matrix_clean, center = TRUE, scale. = TRUE)
# 4. Extract scores (they’ll carry rownames = your SampleIDs)
pca_scores_imp <- as.data.frame(pca_imp_res$x) %>%
rownames_to_column(var = "SampleID")
# 5. Re-attach sample_group (if you need it for plotting)
pca_scores_imp <- pca_scores_imp %>%
left_join(
ZHP_transformed %>%
distinct(SampleID, sample_group),
by = "SampleID"
)
p_after_ZHP <- ggplot(pca_scores_imp, aes(PC1, PC2, color = sample_group)) +
geom_point(size = 3, alpha = 0.8) +
labs(
x = paste0("PC1 (", round(summary(pca_imp_res)$importance[2,1]*100,1), "%)"),
y = paste0("PC2 (", round(summary(pca_imp_res)$importance[2,2]*100,1), "%)"),
color = "Sample Group",
title = "PCA After QLIC Imputation\nand Box-Cox Transformation Intensities (ZHP)"
) +
theme_minimal(base_size = 12)
print(p_after_ZHP)
# 5. Save plot to PNG
ggsave(
filename = "../res/annotated/QC/PCA_boxcox_after_imputation_ZHP.png",
plot = p_after_ZHP,
width = 6,
height = 5,
dpi = 300
)
# ==== Process RPNPF =============================================================
## ==== QRILC imputation on compounds with <20% missingness =====================
RPNPF_filtered <- RPNPF %>%
filter(sample_group %in% c("JMML", "Control")) %>%
mutate(sample_group = factor(sample_group, levels = c("Control", "JMML")))
compound_missingness <- RPNPF_filtered %>%
group_by(compound_name) %>%
summarize(
total = n(),
zeros = sum(intensity == 0, na.rm = TRUE),
NAs = sum(is.na(intensity)),
missing = sum(is.na(intensity) | intensity == 0),
missing_pct = missing / total
) %>%
filter(missing_pct < 0.2)
RPNPF_clean <- RPNPF_filtered %>%
filter(compound_name %in% compound_missingness$compound_name)
RPNPF_wide <- RPNPF_clean %>%
select(SampleID, compound_name, intensity) %>%
mutate(
intensity_log2 = if_else(
is.na(intensity) | intensity == 0,
NA_real_,
log2(intensity + 1)
)
) %>%
select(SampleID, compound_name, intensity_log2) %>%
pivot_wider(names_from = compound_name, values_from = intensity_log2)
# Drop SampleID for imputation
intensity_matrix <- RPNPF_wide %>% select(-SampleID) %>% as.matrix()
# Apply QRLIC imputation
set.seed(123)
imputed_matrix <- impute.QRILC(intensity_matrix)[[1]]
# Set any negative imputed values to zero
imputed_matrix[imputed_matrix < 0] <- 0
# Combine back with SampleID
RPNPF_imputed <- bind_cols(SampleID = RPNPF_wide$SampleID, as.data.frame(imputed_matrix))
inv_RPNPF_imputed <- RPNPF_imputed %>%
mutate(across(-SampleID, ~ 2^.x - 1))
## ==== Box-Cox Transform RPNPF =================================================
# Prepare sample group info
group_info <- RPNPF_filtered %>%
filter(SampleID %in% RPNPF_imputed$SampleID) %>%
distinct(SampleID, sample_group)
# Join group info to imputed data
imputed_long_RPNPF <- inv_RPNPF_imputed %>%
pivot_longer(-SampleID, names_to = "compound_name", values_to = "intensity") %>%
left_join(group_info, by = "SampleID") %>%
filter(sample_group %in% c("JMML", "Control")) %>%
mutate(sample_group = factor(sample_group, levels = c("Control", "JMML")))
# Box-cox transform before t.test:
lambdas <- imputed_long_RPNPF %>%
group_by(compound_name) %>%
summarise(
lambda = {
model <- lm(intensity ~ 1, data = cur_data())
bc <- MASS::boxcox(model, lambda = seq(-2, 2, 0.1), plotit = FALSE)
lambda_opt <- bc$x[which.max(bc$y)]
lambda_opt
},
.groups = "drop"
)
# 2. Join λ and apply transformation only when intensity > 0
RPNPF_transformed <- imputed_long_RPNPF %>%
left_join(lambdas, by = "compound_name") %>%
mutate(
intensity_bc = case_when(
is.na(intensity) | intensity <= 0 | is.na(lambda) ~ NA_real_,
abs(lambda) < 1e-8 ~ log(intensity),
TRUE ~ (intensity^lambda - 1) / lambda
)
)
#Save the imputed raw intensities, log2() intensities, and box-cox intensities
RPNPF_transformed <- RPNPF_transformed %>%
mutate(intensity_log2 = log2(intensity + 1))
## ==== t.test RPNPF ============================================================
# Calculate means by group
group_means <- RPNPF_transformed %>%
group_by(compound_name, sample_group) %>%
summarize(mean_intensity = mean(intensity_log2), .groups = "drop") %>%
pivot_wider(names_from = sample_group, values_from = mean_intensity, names_prefix = "mean_")
# Run t-tests
ttest_results_RPNPF <- RPNPF_transformed %>%
group_by(compound_name) %>%
summarize(
t_test = list(t.test(intensity_bc ~ sample_group)),
.groups = "drop"
) %>%
mutate(
t_statistic = map_dbl(t_test, ~ .x$statistic),
p_value = map_dbl(t_test, ~ .x$p.value)
) %>%
select(compound_name, t_statistic, p_value) %>%
left_join(group_means, by = "compound_name") %>%
mutate(log2_fc = mean_JMML - mean_Control)
## ------- RPNPF PCA Before imputation --------------------------------------------
# PCA Validation of Normalization and Imputation Procedures
#
# PCA is performed on both unimputed and imputed datasets to assess the impact
# of data preprocessing on overall structure and variance.
#
# Step 1: Normalize the unimputed data
# - Normalization is applied to raw intensity values (e.g., log2, Box-Cox).
# - PCA is run with samples as rows and metabolites as columns.
# - This serves as a baseline before any imputation is applied.
# Box-cox transform before t.test:
lambdas_unimp <- RPNPF_clean %>%
group_by(compound_name) %>%
summarise(
lambda = {
model <- lm(intensity ~ 1, data = cur_data())
bc <- MASS::boxcox(model, lambda = seq(-2, 2, 0.1), plotit = FALSE)
bc$x[which.max(bc$y)]
},
.groups = "drop"
)
# 2. Join λ and apply Box–Cox only when intensity > 0
RPNPF_unimp_transformed <- RPNPF_clean %>%
left_join(lambdas_unimp, by = "compound_name") %>%
mutate(
intensity_bc = case_when(
# missing or non-positive intensities → NA
is.na(intensity) | intensity <= 0 | is.na(lambda) ~ NA_real_,
# λ ≈ 0 → log transform
abs(lambda) < 1e-8 ~ log(intensity),
# otherwise Box–Cox
TRUE ~ (intensity^lambda - 1) / lambda
)
)
# Step 2: Impute missing values
# - Missing values are imputed using [insert method, e.g., KNN or QRILC].
# Step 3: Normalize the imputed data (if required by method)
# - Normalization is re-applied if the imputation method alters data scale.
# See above code: QLIC + Box-Cox
head(RPNPF_transformed)
# Step 4: PCA on the imputed dataset
# - PCA is rerun on the normalized, imputed data.
# 1. Pivot to wide, then make SampleID the rownames before dropping metadata
RPNPF_unimp_pca_mat <- RPNPF_unimp_transformed %>%
select(SampleID, sample_group, compound_name, intensity_bc) %>%
pivot_wider(
names_from = compound_name,
values_from = intensity_bc
) %>%
column_to_rownames(var = "SampleID") %>% # ← move SampleID into rownames
select(-sample_group) %>% # ← drop sample_group once rownames set
as.matrix()
# 2. Clean columns with any NA or infinite
keep_cols <- colSums(is.na(RPNPF_unimp_pca_mat) | is.infinite(RPNPF_unimp_pca_mat)) == 0
pca_unimp_matrix_clean <- RPNPF_unimp_pca_mat[, keep_cols]
# 3. Run PCA
pca_unimp_res <- prcomp(pca_unimp_matrix_clean, center = TRUE, scale. = TRUE)
# 4. Extract scores (they’ll carry rownames = your SampleIDs)
pca_scores_unimp <- as.data.frame(pca_unimp_res$x) %>%
rownames_to_column(var = "SampleID")
# 5. Re-attach sample_group (if you need it for plotting)
pca_scores_unimp <- pca_scores_unimp %>%
left_join(
RPNPF_unimp_transformed %>%
distinct(SampleID, sample_group),
by = "SampleID"
)
p_before_RPNPF <- ggplot(pca_scores_unimp, aes(PC1, PC2, color = sample_group)) +
geom_point(size = 3, alpha = 0.8) +
labs(
x = paste0("PC1 (", round(summary(pca_unimp_res)$importance[2,1]*100,1), "%)"),
y = paste0("PC2 (", round(summary(pca_unimp_res)$importance[2,2]*100,1), "%)"),
color = "Sample Group",
title = "PCA Before QLIC Imputation\non Box-Cox Transformed Intensities (RPNPF)"
) +
theme_minimal(base_size = 12)
print(p_before_RPNPF)
# 5. Save plot to PNG
ggsave(
filename = "../res/annotated/QC/PCA_boxcox_before_imputation_RPNPF.png",
plot = p_before_RPNPF,
width = 6,
height = 5,
dpi = 300
)
# Purpose:
# - Compare clustering and variance explained (PC1, PC2, etc.) before and after imputation.
# - Assess whether group structure (e.g., JMML vs. Control) is preserved or clarified.
# - Ensure imputation does not introduce artificial patterns or mask biological signal.
## ------- RPNPF PCA After imputation ---------------------------------------------
# Pivot to wide, then make SampleID the rownames before dropping metadata
RPNPF_imp_pca_mat <- RPNPF_transformed %>%
select(SampleID, sample_group, compound_name, intensity_bc) %>%
pivot_wider(
names_from = compound_name,
values_from = intensity_bc
) %>%
column_to_rownames(var = "SampleID") %>% # ← move SampleID into rownames
select(-sample_group) %>% # ← drop sample_group once rownames set
as.matrix()
# Clean columns with any NA or infinite
keep_cols <- colSums(is.na(RPNPF_imp_pca_mat) | is.infinite(RPNPF_imp_pca_mat)) == 0
pca_imp_matrix_clean <- RPNPF_imp_pca_mat[, keep_cols]
# Run PCA
pca_imp_res <- prcomp(pca_imp_matrix_clean, center = TRUE, scale. = TRUE)
# 4. Extract scores (they’ll carry rownames = your SampleIDs)
pca_scores_imp <- as.data.frame(pca_imp_res$x) %>%
rownames_to_column(var = "SampleID")
# 5. Re-attach sample_group (if you need it for plotting)
pca_scores_imp <- pca_scores_imp %>%
left_join(
RPNPF_transformed %>%
distinct(SampleID, sample_group),
by = "SampleID"
)
p_after_RPNPF <- ggplot(pca_scores_imp, aes(PC1, PC2, color = sample_group)) +
geom_point(size = 3, alpha = 0.8) +
labs(
x = paste0("PC1 (", round(summary(pca_imp_res)$importance[2,1]*100,1), "%)"),
y = paste0("PC2 (", round(summary(pca_imp_res)$importance[2,2]*100,1), "%)"),
color = "Sample Group",
title = "PCA After QLIC Imputation\nand Box-Cox Transformation Intensities (RPNPF)"
) +
theme_minimal(base_size = 12)
print(p_after_RPNPF)
# 5. Save plot to PNG
ggsave(
filename = "../res/annotated/QC/PCA_boxcox_after_imputation_RPNPF.png",
plot = p_after_RPNPF,
width = 6,
height = 5,
dpi = 300
)
# ==== Combining Above =======================================================
#Load fisher:
#source("02_fishers.R")
# Add platform and test labels
fisher_results_ZHP <- fisher_results_ZHP %>%
mutate(platform = "ZHP", test_type = "Fisher")
fisher_results_RPNPF <- fisher_results_RPNPF %>%
mutate(platform = "RPNPF", test_type = "Fisher")
ttest_results_ZHP <- ttest_results_ZHP %>%
mutate(platform = "ZHP", test_type = "TTest")
ttest_results_RPNPF <- ttest_results_RPNPF %>%
mutate(platform = "RPNPF", test_type = "TTest")
# Rename t-test p-values to match Fisher column names
ttest_results_ZHP <- ttest_results_ZHP %>%
rename(effect = t_statistic, raw_p = p_value)
ttest_results_RPNPF <- ttest_results_RPNPF %>%
rename(effect = t_statistic, raw_p = p_value)
fisher_results_ZHP <- fisher_results_ZHP %>%
rename(effect = odds_ratio, raw_p = fisher_pval)
fisher_results_RPNPF <- fisher_results_RPNPF %>%
rename(effect = odds_ratio, raw_p = fisher_pval)
# Combine all
combined_results <- bind_rows(
ttest_results_ZHP,
ttest_results_RPNPF,
fisher_results_ZHP,
fisher_results_RPNPF
)
# Global FDR adjustment
combined_results <- combined_results %>%
mutate(global_fdr = p.adjust(raw_p, method = "BH"))
print(combined_results %>% filter(global_fdr < .05))
write_xlsx(combined_results, "../res/annotated/final_results_global_fdr.xlsx")
saveRDS(ZHP_transformed, "../data/R_objects/ZHP_imputed.rds")
saveRDS(RPNPF_transformed, "../data/R_objects/RPNPF_imputed.rds")QRILC is a method used to fill in missing values in metabolomics data when a compound is present but too low to be detected by the machine. It estimates realistic low values using quantile regression, which preserves the distribution of the data.
We apply a log2(x + 1) transformation on the raw intensities before applying QRLIC. We do this to stabilize variance, reduce skewness, and make the data more normally distributed, which improves the performance of the quantile regression model used in QRILC.
We set any negative values after imputation to 0.
We use PCA plots (using Box-Cox transformed intensities) to validate our imputation.
What we are looking for in the PCA plots is that the overall clustering of the sample groups does not change dramatically, and that no outliers are introduced.
Imputation seems to have worked fine for both ZHP and RPNPF.
After imputation and a Box-Cox transformation, we apply a two-sided t-test between JMML and Control for each metabolite in the 20%< missingness cohort.
library(tidyverse)
library(readxl)
library(ggrepel)
# ==== Load data ===============================================================
#source("03_QLIC_imputation.R")
#assumes the objects in 03_QLIC are in the environment
final_results <- read_xlsx("../res/annotated/final_results_global_fdr.xlsx")
# ==== ZHP Visualizations ======================================================
# ==== FDR Visualizations ======================================================
# ==== Viz Fish FDR ============================================================
fish_ZHP <- final_results %>%
filter(test_type == "Fisher") %>%
filter(platform == "ZHP")
# Add enrichment direction labels
fish_plot_ZHP_fdr <- fish_ZHP %>%
mutate(
direction = case_when(
is.na(log2_OR) ~ "Ambiguous",
log2_OR >= 1 ~ "Detected more in JMML",
log2_OR <= -1 ~ "Detected more in Control",
abs(log2_OR) < 1 ~ "Ambiguous"
),
neglog10_fdr = -log10(global_fdr),
sig = ifelse(global_fdr < 0.05, "Significant", "Not Significant")) %>%
mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
fish_p_ZHP_fdr <- ggplot(fish_plot_ZHP_fdr, aes(x = log2_OR, y = neglog10_fdr)) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray") +
geom_point(aes(color = direction, shape = sig), alpha = 0.7) +
geom_text_repel(data = fish_plot_ZHP_fdr %>% filter(sig == "Significant"),
aes(label = compound_name),
size = 3, max.overlaps = 15) +
scale_color_manual(values = c(
"Detected more in JMML" = "#E74C3C",
"Detected more in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "Fisher’s Exact Test on Missingness (ZHP Platform)",
caption = paste0("n = ", nrow(fish_plot_ZHP_fdr),
" metabolites tested using Fisher’s Exact Test"),
x = "log2(Odds Ratio: JMML vs. Control)\n↑ More detected in JMML, ↓ More detected in Control",
y = "-log10(FDR)",
color = "Relative Detection",
shape = "Statistically Significant\n(FDR < 0.05)"
)+
theme_bw(base_size = 14)
print(fish_p_ZHP_fdr)
ggsave(filename = "../res/annotated/fisher_volcano_ZHP_FDR.png", plot = fish_p_ZHP_fdr)
# ==== ZHP - TTest Visualizations FDR ==========================================
ttest_ZHP <- final_results %>%
filter(test_type == "TTest") %>%
filter(platform == "ZHP")
# ==== Volcano Plot of Differences ====
ttest_results_Combined_ZHP_fdr <- ttest_ZHP %>%
mutate(
direction = case_when(
log2_fc > 0 ~ "Higher in JMML",
log2_fc < 0 ~ "Higher in Control",
TRUE ~ "Ambiguous"
),
sig = ifelse(global_fdr < 0.05, "Significant", "Not Significant")
) %>% mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
ZHP_ttest_volcano_FDR <- ggplot(
ttest_results_Combined_ZHP_fdr,
aes(x = log2_fc, y = -log10(global_fdr))
) +
geom_point(aes(color = direction, shape = sig), alpha = 0.7, size = 1.5) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray50") +
geom_text_repel(
data = ttest_results_Combined_ZHP_fdr %>% filter(sig == "Significant"),
aes(label = compound_name),
size = 2.5,
max.overlaps = 15,
box.padding = 0.3
) +
scale_color_manual(values = c(
"Higher in JMML" = "#E74C3C",
"Higher in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "T-test on Imputed Box-Cox Transformed Intensities\n(ZHP Platform)",
caption = paste0("n = ", nrow(ttest_results_Combined_ZHP_fdr), " metabolites tested via two-sided t-test."),
x = "log2FC(JMML / Control)\n↑ Higher in JMML, ↓ Higher in Control",
y = "-log10(FDR)\nfrom t-test",
color = "Relative Intensity",
shape = "Statistically Significant\n(FDR < 0.05)"
) +
theme_bw(base_size = 14)
print(ZHP_ttest_volcano_FDR)
# Save volcano plot
ggsave("../res/annotated/ZHP_ttest_volcano_FDR.png",
ZHP_ttest_volcano_FDR, width = 10, height = 6, dpi = 300)
# ==== Top 5 Violin Plots of Most Significant Sex Differences ====
# Identify top 5 metabolites by raw p-value
top5_Combined_ZHP <- ttest_results_Combined_ZHP_fdr %>%
arrange(raw_p) %>%
slice_head(n = 5) %>%
pull(compound_name)
p_violin_ZHP <- ggplot(
ZHP_transformed %>% filter(compound_name %in% top5_Combined_ZHP),
aes(x = sample_group, y = intensity_log2, fill = sample_group)
) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_jitter(width = 0.2, size = 1, alpha = 0.5) +
facet_wrap(~ compound_name, scales = "free_y") +
scale_fill_manual(values = c("JMML" = "#E74C3C", "Control" = "#3498DB")) +
labs(
title = "Top 5 JMML-associated Metabolites",
subtitle = "T-test on Imputed Intensities (ZHP Platform)",
x = "Case Status",
y = "log2(Intensity)",
fill = ""
) +
theme_bw(base_size = 14) +
theme(
strip.text = element_text(size = 10, face = "bold"),
legend.position = "right"
)
p_violin_ZHP
# Save violin plot
ggsave(
filename = "../res/annotated/ZHP_top5_ttest.png",
plot = p_violin_ZHP,
width = 10, height = 6, units = "in", dpi = 300
)
# ==== Raw-p value Visualizations ==============================================
# ==== Viz Fish ZHP P-value ===================================================
fish_ZHP <- final_results %>%
filter(test_type == "Fisher") %>%
filter(platform == "ZHP")
# Add enrichment direction labels
fish_plot_ZHP_p <- fish_ZHP %>%
mutate(
direction = case_when(
is.na(log2_OR) ~ "Ambiguous",
log2_OR >= 1 ~ "Detected more in JMML",
log2_OR <= -1 ~ "Detected more in Control",
abs(log2_OR) < 1 ~ "Ambiguous"
),
neglog10_p = -log10(raw_p),
sig = ifelse(raw_p < 0.05, "Significant", "Not Significant")) %>%
mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
# Drop columns that are entirely NA
fish_plot_ZHP_p_clean <- fish_plot_ZHP_p %>%
select(where(~ !all(is.na(.)))) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
write.csv(fish_plot_ZHP_p_clean, file = "../res/annotated/clean_fisher_ZHP_table.csv")
fish_p_ZHP <- ggplot(fish_plot_ZHP_p, aes(x = log2_OR, y = neglog10_p)) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray") +
geom_point(aes(color = direction, shape = sig), alpha = 0.7) +
# geom_text_repel(data = fish_plot_ZHP_p %>% filter(sig == "Significant"),
# aes(label = compound_name),
# size = 3, max.overlaps = 15) +
scale_color_manual(values = c(
"Detected more in JMML" = "#E74C3C",
"Detected more in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "Fisher’s Exact Test on Missingness (ZHP Platform)",
caption = paste0("n = ", nrow(fish_plot_ZHP_p ),
" metabolites tested using Fisher’s Exact Test"),
x = "log2(Odds Ratio: JMML vs. Control)\n↑ More detected in JMML, ↓ More detected in Control",
y = "-log10(P-value)",
color = "Relative Detection",
shape = "Statistically Significant\n(P-value < 0.05)"
)+
theme_bw(base_size = 14)
ggsave(filename = "../res/annotated/fisher_volcano_ZHP_pval.png", plot = fish_p_ZHP)
# ==== ZHP - TTest Visualizations FDR ==========================================
ttest_ZHP <- final_results %>%
filter(test_type == "TTest") %>%
filter(platform == "ZHP")
# ==== Volcano Plot of Sex Differences ====
ttest_results_Combined_ZHP_p <- ttest_ZHP %>%
mutate(
direction = case_when(
log2_fc > 0 ~ "Higher in JMML",
log2_fc < 0 ~ "Higher in Control",
TRUE ~ "Ambiguous"
),
sig = ifelse(raw_p < 0.05, "Significant", "Not Significant")
) %>% mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
# Drop columns that are entirely NA
ttest_results_Combined_ZHP_clean <- ttest_results_Combined_ZHP_p %>%
select(where(~ !all(is.na(.)))) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
write.csv(ttest_results_Combined_ZHP_clean,
file = "../res/annotated/ttest_results_Combined_ZHP_clean.csv")
ZHP_ttest_volcano_pval <- ggplot(
ttest_results_Combined_ZHP_p ,
aes(x = log2_fc, y = -log10(raw_p))
) +
geom_point(aes(color = direction, shape = sig), alpha = 0.7, size = 1.5) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray50") +
scale_color_manual(values = c(
"Higher in JMML" = "#E74C3C",
"Higher in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "T-test on Imputed Box-Cox Transformed Intensities\n(ZHP Platform)",
caption = paste0("n = ", nrow(ttest_results_Combined_ZHP_p), " metabolites tested via two-sided t-test."),
x = "log2FC(JMML / Control)\n↑ Higher in JMML, ↓ Higher in Control",
y = "-log10(p-value)\nfrom t-test",
color = "Relative Intensity",
shape = "Statistically Significant\n(p-value < 0.05)"
) +
theme_bw(base_size = 14)
print(ZHP_ttest_volcano_pval)
# Save volcano plot
ggsave("../res/annotated/ZHP_ttest_volcano_pval.png",
ZHP_ttest_volcano_pval, width = 10, height = 6, dpi = 300)
# ==== RPNPF Visualizations ======================================================
# ==== FDR Visualizations ======================================================
# ==== Viz Fish FDR ============================================================
fish_RPNPF <- final_results %>%
filter(test_type == "Fisher") %>%
filter(platform == "RPNPF")
# Add enrichment direction labels
fish_plot_RPNPF_fdr <- fish_RPNPF %>%
mutate(
direction = case_when(
is.na(log2_OR) ~ "Ambiguous",
log2_OR >= 1 ~ "Detected more in JMML",
log2_OR <= -1 ~ "Detected more in Control",
abs(log2_OR) < 1 ~ "Ambiguous"
),
neglog10_fdr = -log10(global_fdr),
sig = ifelse(global_fdr < 0.05, "Significant", "Not Significant")) %>%
mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
fish_p_RPNPF_fdr <- ggplot(fish_plot_RPNPF_fdr, aes(x = log2_OR, y = neglog10_fdr)) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray") +
geom_point(aes(color = direction, shape = sig), alpha = 0.7) +
geom_text_repel(data = fish_plot_RPNPF_fdr %>% filter(sig == "Significant"),
aes(label = compound_name),
size = 3, max.overlaps = 15) +
scale_color_manual(values = c(
"Detected more in JMML" = "#E74C3C",
"Detected more in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "Fisher’s Exact Test on Missingness (RPNPF Platform)",
caption = paste0("n = ", nrow(fish_plot_RPNPF_fdr),
" metabolites tested using Fisher’s Exact Test"),
x = "log2(Odds Ratio: JMML vs. Control)\n↑ More detected in JMML, ↓ More detected in Control",
y = "-log10(FDR)",
color = "Relative Detection",
shape = "Statistically Significant\n(FDR < 0.05)"
)+
theme_bw(base_size = 14)
print(fish_p_RPNPF_fdr)
ggsave(filename = "../res/annotated/fisher_volcano_RPNPF_FDR.png", plot = fish_p_RPNPF_fdr)
# ==== RPNPF - TTest Visualizations FDR ==========================================
ttest_RPNPF <- final_results %>%
filter(test_type == "TTest") %>%
filter(platform == "RPNPF")
# ==== Volcano Plot of Sex Differences ====
ttest_results_Combined_RPNPF_fdr <- ttest_RPNPF %>%
mutate(
direction = case_when(
log2_fc > 0 ~ "Higher in JMML",
log2_fc < 0 ~ "Higher in Control",
TRUE ~ "Ambiguous"
),
sig = ifelse(global_fdr < 0.05, "Significant", "Not Significant")
) %>% mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
RPNPF_ttest_volcano_FDR <- ggplot(
ttest_results_Combined_RPNPF_fdr,
aes(x = log2_fc, y = -log10(global_fdr))
) +
geom_point(aes(color = direction, shape = sig), alpha = 0.7, size = 1.5) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray50") +
geom_text_repel(
data = ttest_results_Combined_RPNPF_fdr %>% filter(sig == "Significant"),
aes(label = compound_name),
size = 2.5,
max.overlaps = 15,
box.padding = 0.3
) +
scale_color_manual(values = c(
"Higher in JMML" = "#E74C3C",
"Higher in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "T-test on Imputed Box-Cox Transformed Intensities\n(RPNPF Platform)",
caption = paste0("n = ", nrow(ttest_results_Combined_RPNPF_fdr), " metabolites tested via two-sided t-test."),
x = "log2FC(JMML / Control)\n↑ Higher in JMML, ↓ Higher in Control",
y = "-log10(FDR)\nfrom t-test",
color = "Relative Intensity",
shape = "Statistically Significant\n(FDR < 0.05)"
) +
theme_bw(base_size = 14)
print(RPNPF_ttest_volcano_FDR)
# Save volcano plot
ggsave("../res/annotated/RPNPF_ttest_volcano_FDR.png",
RPNPF_ttest_volcano_FDR, width = 10, height = 6, dpi = 300)
# ==== Top 5 Violin Plots of Most Significant Sex Differences ====
# Identify top 5 metabolites by raw p-value
top5_Combined_RPNPF <- ttest_results_Combined_RPNPF_fdr %>%
arrange(raw_p) %>%
slice_head(n = 5) %>%
pull(compound_name)
p_violin_RPNPF <- ggplot(
RPNPF_transformed %>% filter(compound_name %in% top5_Combined_RPNPF),
aes(x = sample_group, y = intensity_log2, fill = sample_group)
) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_jitter(width = 0.2, size = 1, alpha = 0.5) +
facet_wrap(~ compound_name, scales = "free_y") +
scale_fill_manual(values = c("JMML" = "#E74C3C", "Control" = "#3498DB")) +
labs(
title = "Top 5 JMML-associated Metabolites",
subtitle = "T-test on Imputed Intensities (RPNPF Platform)",
x = "Case Status",
y = "log2(Intensity)",
fill = ""
) +
theme_bw(base_size = 14) +
theme(
strip.text = element_text(size = 10, face = "bold"),
legend.position = "right"
)
p_violin_RPNPF
# Save violin plot
ggsave(
filename = "../res/annotated/RPNPF_top5_ttest.png",
plot = p_violin_RPNPF,
width = 10, height = 6, units = "in", dpi = 300
)
# ==== Raw-p value Visualizations ==============================================
# ==== Viz Fish RPNPF P-value ===================================================
fish_RPNPF <- final_results %>%
filter(test_type == "Fisher") %>%
filter(platform == "RPNPF")
# Add enrichment direction labels
fish_plot_RPNPF_p <- fish_RPNPF %>%
mutate(
direction = case_when(
is.na(log2_OR) ~ "Ambiguous",
log2_OR >= 1 ~ "Detected more in JMML",
log2_OR <= -1 ~ "Detected more in Control",
abs(log2_OR) < 1 ~ "Ambiguous"
),
neglog10_p = -log10(raw_p),
sig = ifelse(raw_p < 0.05, "Significant", "Not Significant")) %>%
mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
# Drop columns that are entirely NA
fish_plot_RPNPF_p_clean <- fish_plot_RPNPF_p %>%
select(where(~ !all(is.na(.)))) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
write.csv(fish_plot_RPNPF_p_clean, file = "../res/annotated/clean_fisher_RPNPF_table.csv")
fish_p_RPNPF <- ggplot(fish_plot_RPNPF_p, aes(x = log2_OR, y = neglog10_p)) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray") +
geom_point(aes(color = direction, shape = sig), alpha = 0.7) +
# geom_text_repel(data = fish_plot_RPNPF_p %>% filter(sig == "Significant"),
# aes(label = compound_name),
# size = 3, max.overlaps = 15) +
scale_color_manual(values = c(
"Detected more in JMML" = "#E74C3C",
"Detected more in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "Fisher’s Exact Test on Missingness (RPNPF Platform)",
caption = paste0("n = ", nrow(fish_plot_RPNPF_p ),
" metabolites tested using Fisher’s Exact Test"),
x = "log2(Odds Ratio: JMML vs. Control)\n↑ More detected in JMML, ↓ More detected in Control",
y = "-log10(P-value)",
color = "Relative Detection",
shape = "Statistically Significant\n(P-value < 0.05)"
)+
theme_bw(base_size = 14)
print(fish_p_RPNPF)
ggsave(filename = "../res/annotated/fisher_volcano_RPNPF_pval.png", plot = fish_p_RPNPF)
# ==== RPNPF - TTest Visualizations FDR ==========================================
ttest_RPNPF <- final_results %>%
filter(test_type == "TTest") %>%
filter(platform == "RPNPF")
# ==== Volcano Plot of Sex Differences ====
ttest_results_Combined_RPNPF_p <- ttest_RPNPF %>%
mutate(
direction = case_when(
log2_fc > 0 ~ "Higher in JMML",
log2_fc < 0 ~ "Higher in Control",
TRUE ~ "Ambiguous"
),
sig = ifelse(raw_p < 0.05, "Significant", "Not Significant")
) %>% mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
RPNPF_ttest_volcano_pval <- ggplot(
ttest_results_Combined_RPNPF_p ,
aes(x = log2_fc, y = -log10(raw_p))
) +
geom_point(aes(color = direction, shape = sig), alpha = 0.7, size = 1.5) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray50") +
scale_color_manual(values = c(
"Higher in JMML" = "#E74C3C",
"Higher in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "T-test on Imputed Box-Cox Transformed Intensities\n(RPNPF Platform)",
caption = paste0("n = ", nrow(ttest_results_Combined_RPNPF_p), " metabolites tested via two-sided t-test."),
x = "log2FC(JMML / Control)\n↑ Higher in JMML, ↓ Higher in Control",
y = "-log10(p-value)\nfrom t-test",
color = "Relative Intensity",
shape = "Statistically Significant\n(p-value < 0.05)"
) +
theme_bw(base_size = 14)
print(RPNPF_ttest_volcano_pval)
# Save volcano plot
ggsave("../res/annotated/RPNPF_ttest_volcano_pval.png",
RPNPF_ttest_volcano_pval, width = 10, height = 6, dpi = 300)
# ==== Significant Hits ========================================================
# ==== Viz ttest sig hit FDR 2-hydro ===========================================
t_res <- final_results %>% filter(test_type == "TTest") %>% filter(global_fdr < .05)
ZHP_transformed <- readRDS("../data/R_objects/ZHP_imputed.rds")
ttest_sig <- ggplot(ZHP_transformed %>% filter(compound_name == "2-hydroxychrysene"),
aes(x = sample_group, y = intensity_log2, fill = sample_group)) +
geom_violin(trim = FALSE, scale = "width") +
geom_jitter(width = 0.15, size = 1.5, alpha = 0.6, shape = 21, stroke = 0.2) +
stat_summary(fun = mean, geom = "point", shape = 21, size = 2, fill = "white") +
scale_fill_manual(values = c("Control" = "#3498DB", "JMML" = "#E74C3C")) +
labs(
title = "",
x = "",
y = "log2(Intensity)",
fill = ""
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "bottom",
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold")
)
print(ttest_sig)
#Distribution of log2(intensity) of 2-hydroxychrysene
ggsave(filename = "../res/annotated/ttest_violin_2-hydroxychrysene.png", plot = ttest_sig)
# ==== Viz violin sig hit Fisher ===========================================
fish_res_top <- final_results %>%
filter(test_type == "Fisher") %>%
filter(global_fdr < .05) %>%
pull(compound_name)
# Load the non-imputed intensities (M2)
ZHP_filtered_fish <- readRDS("../data/R_objects/ZHP_annotated_filtered.rds") %>%
filter(sample_group %in% c("JMML", "Control")) %>%
filter(compound_name %in% fish_res_top) %>%
mutate(intensity_log2 = log2(intensity + 1))
writexl::write_xlsx(ZHP_filtered_fish, path = "../res/annotated/Fisher_FDR_hits.xlsx")
fish_res_violin_ZHP <- ggplot(
ZHP_filtered_fish,
aes(x = sample_group, y = intensity_log2, fill = sample_group)
) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_jitter(width = 0.2, size = 1, alpha = 0.5) +
facet_wrap(~ compound_name, scales = "free_y") +
scale_fill_manual(values = c("JMML" = "#E74C3C", "Control" = "#3498DB")) +
labs(
title = "Top 4 JMML-associated Metabolites ",
subtitle = "Fisher's Exact Test on Missingness (ZHP Platform)",
x = "Case Status",
y = "log2(Intensity)",
fill = ""
) +
theme_bw(base_size = 14) +
theme(
strip.text = element_text(size = 10, face = "bold"),
legend.position = "right"
)
fish_res_violin_ZHP
# Save violin plot
ggsave(
filename = "../res/annotated/ZHP_top4violin_fish_res.png",
plot = fish_res_violin_ZHP,
width = 10, height = 6, units = "in", dpi = 300
)
# ==== Lilly Poster ==========================================
set.seed(42)
# build the plot
fish_p_ZHP_fdr_pub <- ggplot(fish_plot_ZHP_fdr, aes(x = log2_OR, y = neglog10_fdr)) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray80") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray80") +
geom_point(aes(color = direction, shape = sig), alpha = 0.7) +
geom_text_repel(
data = filter(fish_plot_ZHP_fdr, sig == "Significant"),
aes(label = compound_name),
size = 3,
box.padding = 0.6,
point.padding = 0.4,
force = 2,
max.overlaps = Inf
) +
scale_color_manual(values = c(
"Detected more in JMML" = "#E74C3C",
"Detected more in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) +
labs(
x = expression(Log[2](OR)~"(JMML/Control)"),
y = expression(-Log[10](FDR)),
color = "Relative Detection",
shape = "Statistically Significant\n(FDR < 0.05)"
) +
theme_classic(base_size = 14, base_family = "Times") +
theme(
axis.title.x = element_text(face = "bold",
size = 14,
margin = ggplot2::margin(t = 8)),
axis.title.y = element_text(face = "bold",
size = 14,
margin = ggplot2::margin(r = 8)),
axis.text = element_text(size = 12),
legend.key.size = unit(0.8, "lines"),
legend.position = "bottom",
legend.box = "vertical",
legend.title = element_text(face = "bold"),
legend.text = element_text(size = 11),
legend.margin = margin(t = 5)
) +
guides(
color = guide_legend(title.position = "top", nrow = 1),
shape = guide_legend(title.position = "top", nrow = 1)
)
#print(fish_p_ZHP_fdr_pub)
ggsave(filename = "../res/annotated/fisher_volcano_ZHP_FDR_pub.png", plot = fish_p_ZHP_fdr_pub)Table Legend
compound_name: Name of the metabolite
effect: T-statistic from the two-sample t-test comparing JMML vs. Control
raw_p: Unadjusted p-value from the t-test
mean_Control: Average log₂-transformed intensity in Control samples
mean_JMML: Average log₂-transformed intensity in JMML samples
log2_fc: Log₂ fold change (JMML – Control); positive values = higher in JMML
platform: Analytical platform (ZHP or RPNPF)
test_type: Type of statistical test (TTest for abundance comparisons)
global_fdr: Benjamini-Hochberg adjusted p-value
direction: Interpretation of the fold change (e.g., "Detected more in JMML")
sig: Significance label after FDR correction (e.g., “Significant”, “Not Significant”)
For each metabolite, we test the null hypothesis that there is no meaningful difference in intensity between JMML and Control groups.
Because we are testing hundreds of metabolites across two platforms using both t-tests and Fisher’s exact tests, the risk of false positives increases with the number of comparisons. To address this, we apply the Benjamini-Hochberg procedure to adjust raw p-values and control the False Discovery Rate (FDR) - the expected proportion of false positives among the results we call statistically significant.
This helps ensure that our list of significant findings is not dominated by random noise from multiple testing.
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
We will use the table of significant hits (significance level of 0.05) from the unadjusted p-values to perform pathway analysis:
library(tidyverse)
library(readxl)
library(ggrepel)
# ==== Load data ===============================================================
#source("03_QLIC_imputation.R")
#assumes the objects in 03_QLIC are in the environment
final_results <- read_xlsx("../res/annotated/final_results_global_fdr.xlsx")
# ==== ZHP Visualizations ======================================================
# ==== FDR Visualizations ======================================================
# ==== Viz Fish FDR ============================================================
fish_ZHP <- final_results %>%
filter(test_type == "Fisher") %>%
filter(platform == "ZHP")
# Add enrichment direction labels
fish_plot_ZHP_fdr <- fish_ZHP %>%
mutate(
direction = case_when(
is.na(log2_OR) ~ "Ambiguous",
log2_OR >= 1 ~ "Detected more in JMML",
log2_OR <= -1 ~ "Detected more in Control",
abs(log2_OR) < 1 ~ "Ambiguous"
),
neglog10_fdr = -log10(global_fdr),
sig = ifelse(global_fdr < 0.05, "Significant", "Not Significant")) %>%
mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
fish_p_ZHP_fdr <- ggplot(fish_plot_ZHP_fdr, aes(x = log2_OR, y = neglog10_fdr)) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray") +
geom_point(aes(color = direction, shape = sig), alpha = 0.7) +
geom_text_repel(data = fish_plot_ZHP_fdr %>% filter(sig == "Significant"),
aes(label = compound_name),
size = 3, max.overlaps = 15) +
scale_color_manual(values = c(
"Detected more in JMML" = "#E74C3C",
"Detected more in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "Fisher’s Exact Test on Missingness (ZHP Platform)",
caption = paste0("n = ", nrow(fish_plot_ZHP_fdr),
" metabolites tested using Fisher’s Exact Test"),
x = "log2(Odds Ratio: JMML vs. Control)\n↑ More detected in JMML, ↓ More detected in Control",
y = "-log10(FDR)",
color = "Relative Detection",
shape = "Statistically Significant\n(FDR < 0.05)"
)+
theme_bw(base_size = 14)
print(fish_p_ZHP_fdr)
ggsave(filename = "../res/annotated/fisher_volcano_ZHP_FDR.png", plot = fish_p_ZHP_fdr)
# ==== ZHP - TTest Visualizations FDR ==========================================
ttest_ZHP <- final_results %>%
filter(test_type == "TTest") %>%
filter(platform == "ZHP")
# ==== Volcano Plot of Differences ====
ttest_results_Combined_ZHP_fdr <- ttest_ZHP %>%
mutate(
direction = case_when(
log2_fc > 0 ~ "Higher in JMML",
log2_fc < 0 ~ "Higher in Control",
TRUE ~ "Ambiguous"
),
sig = ifelse(global_fdr < 0.05, "Significant", "Not Significant")
) %>% mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
ZHP_ttest_volcano_FDR <- ggplot(
ttest_results_Combined_ZHP_fdr,
aes(x = log2_fc, y = -log10(global_fdr))
) +
geom_point(aes(color = direction, shape = sig), alpha = 0.7, size = 1.5) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray50") +
geom_text_repel(
data = ttest_results_Combined_ZHP_fdr %>% filter(sig == "Significant"),
aes(label = compound_name),
size = 2.5,
max.overlaps = 15,
box.padding = 0.3
) +
scale_color_manual(values = c(
"Higher in JMML" = "#E74C3C",
"Higher in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "T-test on Imputed Box-Cox Transformed Intensities\n(ZHP Platform)",
caption = paste0("n = ", nrow(ttest_results_Combined_ZHP_fdr), " metabolites tested via two-sided t-test."),
x = "log2FC(JMML / Control)\n↑ Higher in JMML, ↓ Higher in Control",
y = "-log10(FDR)\nfrom t-test",
color = "Relative Intensity",
shape = "Statistically Significant\n(FDR < 0.05)"
) +
theme_bw(base_size = 14)
print(ZHP_ttest_volcano_FDR)
# Save volcano plot
ggsave("../res/annotated/ZHP_ttest_volcano_FDR.png",
ZHP_ttest_volcano_FDR, width = 10, height = 6, dpi = 300)
# ==== Top 5 Violin Plots of Most Significant Sex Differences ====
# Identify top 5 metabolites by raw p-value
top5_Combined_ZHP <- ttest_results_Combined_ZHP_fdr %>%
arrange(raw_p) %>%
slice_head(n = 5) %>%
pull(compound_name)
p_violin_ZHP <- ggplot(
ZHP_transformed %>% filter(compound_name %in% top5_Combined_ZHP),
aes(x = sample_group, y = intensity_log2, fill = sample_group)
) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_jitter(width = 0.2, size = 1, alpha = 0.5) +
facet_wrap(~ compound_name, scales = "free_y") +
scale_fill_manual(values = c("JMML" = "#E74C3C", "Control" = "#3498DB")) +
labs(
title = "Top 5 JMML-associated Metabolites",
subtitle = "T-test on Imputed Intensities (ZHP Platform)",
x = "Case Status",
y = "log2(Intensity)",
fill = ""
) +
theme_bw(base_size = 14) +
theme(
strip.text = element_text(size = 10, face = "bold"),
legend.position = "right"
)
p_violin_ZHP
# Save violin plot
ggsave(
filename = "../res/annotated/ZHP_top5_ttest.png",
plot = p_violin_ZHP,
width = 10, height = 6, units = "in", dpi = 300
)
# ==== Raw-p value Visualizations ==============================================
# ==== Viz Fish ZHP P-value ===================================================
fish_ZHP <- final_results %>%
filter(test_type == "Fisher") %>%
filter(platform == "ZHP")
# Add enrichment direction labels
fish_plot_ZHP_p <- fish_ZHP %>%
mutate(
direction = case_when(
is.na(log2_OR) ~ "Ambiguous",
log2_OR >= 1 ~ "Detected more in JMML",
log2_OR <= -1 ~ "Detected more in Control",
abs(log2_OR) < 1 ~ "Ambiguous"
),
neglog10_p = -log10(raw_p),
sig = ifelse(raw_p < 0.05, "Significant", "Not Significant")) %>%
mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
# Drop columns that are entirely NA
fish_plot_ZHP_p_clean <- fish_plot_ZHP_p %>%
select(where(~ !all(is.na(.)))) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
write.csv(fish_plot_ZHP_p_clean, file = "../res/annotated/clean_fisher_ZHP_table.csv")
fish_p_ZHP <- ggplot(fish_plot_ZHP_p, aes(x = log2_OR, y = neglog10_p)) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray") +
geom_point(aes(color = direction, shape = sig), alpha = 0.7) +
# geom_text_repel(data = fish_plot_ZHP_p %>% filter(sig == "Significant"),
# aes(label = compound_name),
# size = 3, max.overlaps = 15) +
scale_color_manual(values = c(
"Detected more in JMML" = "#E74C3C",
"Detected more in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "Fisher’s Exact Test on Missingness (ZHP Platform)",
caption = paste0("n = ", nrow(fish_plot_ZHP_p ),
" metabolites tested using Fisher’s Exact Test"),
x = "log2(Odds Ratio: JMML vs. Control)\n↑ More detected in JMML, ↓ More detected in Control",
y = "-log10(P-value)",
color = "Relative Detection",
shape = "Statistically Significant\n(P-value < 0.05)"
)+
theme_bw(base_size = 14)
ggsave(filename = "../res/annotated/fisher_volcano_ZHP_pval.png", plot = fish_p_ZHP)
# ==== ZHP - TTest Visualizations FDR ==========================================
ttest_ZHP <- final_results %>%
filter(test_type == "TTest") %>%
filter(platform == "ZHP")
# ==== Volcano Plot of Sex Differences ====
ttest_results_Combined_ZHP_p <- ttest_ZHP %>%
mutate(
direction = case_when(
log2_fc > 0 ~ "Higher in JMML",
log2_fc < 0 ~ "Higher in Control",
TRUE ~ "Ambiguous"
),
sig = ifelse(raw_p < 0.05, "Significant", "Not Significant")
) %>% mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
# Drop columns that are entirely NA
ttest_results_Combined_ZHP_clean <- ttest_results_Combined_ZHP_p %>%
select(where(~ !all(is.na(.)))) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
write.csv(ttest_results_Combined_ZHP_clean,
file = "../res/annotated/ttest_results_Combined_ZHP_clean.csv")
ZHP_ttest_volcano_pval <- ggplot(
ttest_results_Combined_ZHP_p ,
aes(x = log2_fc, y = -log10(raw_p))
) +
geom_point(aes(color = direction, shape = sig), alpha = 0.7, size = 1.5) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray50") +
scale_color_manual(values = c(
"Higher in JMML" = "#E74C3C",
"Higher in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "T-test on Imputed Box-Cox Transformed Intensities\n(ZHP Platform)",
caption = paste0("n = ", nrow(ttest_results_Combined_ZHP_p), " metabolites tested via two-sided t-test."),
x = "log2FC(JMML / Control)\n↑ Higher in JMML, ↓ Higher in Control",
y = "-log10(p-value)\nfrom t-test",
color = "Relative Intensity",
shape = "Statistically Significant\n(p-value < 0.05)"
) +
theme_bw(base_size = 14)
print(ZHP_ttest_volcano_pval)
# Save volcano plot
ggsave("../res/annotated/ZHP_ttest_volcano_pval.png",
ZHP_ttest_volcano_pval, width = 10, height = 6, dpi = 300)
# ==== RPNPF Visualizations ======================================================
# ==== FDR Visualizations ======================================================
# ==== Viz Fish FDR ============================================================
fish_RPNPF <- final_results %>%
filter(test_type == "Fisher") %>%
filter(platform == "RPNPF")
# Add enrichment direction labels
fish_plot_RPNPF_fdr <- fish_RPNPF %>%
mutate(
direction = case_when(
is.na(log2_OR) ~ "Ambiguous",
log2_OR >= 1 ~ "Detected more in JMML",
log2_OR <= -1 ~ "Detected more in Control",
abs(log2_OR) < 1 ~ "Ambiguous"
),
neglog10_fdr = -log10(global_fdr),
sig = ifelse(global_fdr < 0.05, "Significant", "Not Significant")) %>%
mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
fish_p_RPNPF_fdr <- ggplot(fish_plot_RPNPF_fdr, aes(x = log2_OR, y = neglog10_fdr)) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray") +
geom_point(aes(color = direction, shape = sig), alpha = 0.7) +
geom_text_repel(data = fish_plot_RPNPF_fdr %>% filter(sig == "Significant"),
aes(label = compound_name),
size = 3, max.overlaps = 15) +
scale_color_manual(values = c(
"Detected more in JMML" = "#E74C3C",
"Detected more in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "Fisher’s Exact Test on Missingness (RPNPF Platform)",
caption = paste0("n = ", nrow(fish_plot_RPNPF_fdr),
" metabolites tested using Fisher’s Exact Test"),
x = "log2(Odds Ratio: JMML vs. Control)\n↑ More detected in JMML, ↓ More detected in Control",
y = "-log10(FDR)",
color = "Relative Detection",
shape = "Statistically Significant\n(FDR < 0.05)"
)+
theme_bw(base_size = 14)
print(fish_p_RPNPF_fdr)
ggsave(filename = "../res/annotated/fisher_volcano_RPNPF_FDR.png", plot = fish_p_RPNPF_fdr)
# ==== RPNPF - TTest Visualizations FDR ==========================================
ttest_RPNPF <- final_results %>%
filter(test_type == "TTest") %>%
filter(platform == "RPNPF")
# ==== Volcano Plot of Sex Differences ====
ttest_results_Combined_RPNPF_fdr <- ttest_RPNPF %>%
mutate(
direction = case_when(
log2_fc > 0 ~ "Higher in JMML",
log2_fc < 0 ~ "Higher in Control",
TRUE ~ "Ambiguous"
),
sig = ifelse(global_fdr < 0.05, "Significant", "Not Significant")
) %>% mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
RPNPF_ttest_volcano_FDR <- ggplot(
ttest_results_Combined_RPNPF_fdr,
aes(x = log2_fc, y = -log10(global_fdr))
) +
geom_point(aes(color = direction, shape = sig), alpha = 0.7, size = 1.5) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray50") +
geom_text_repel(
data = ttest_results_Combined_RPNPF_fdr %>% filter(sig == "Significant"),
aes(label = compound_name),
size = 2.5,
max.overlaps = 15,
box.padding = 0.3
) +
scale_color_manual(values = c(
"Higher in JMML" = "#E74C3C",
"Higher in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "T-test on Imputed Box-Cox Transformed Intensities\n(RPNPF Platform)",
caption = paste0("n = ", nrow(ttest_results_Combined_RPNPF_fdr), " metabolites tested via two-sided t-test."),
x = "log2FC(JMML / Control)\n↑ Higher in JMML, ↓ Higher in Control",
y = "-log10(FDR)\nfrom t-test",
color = "Relative Intensity",
shape = "Statistically Significant\n(FDR < 0.05)"
) +
theme_bw(base_size = 14)
print(RPNPF_ttest_volcano_FDR)
# Save volcano plot
ggsave("../res/annotated/RPNPF_ttest_volcano_FDR.png",
RPNPF_ttest_volcano_FDR, width = 10, height = 6, dpi = 300)
# ==== Top 5 Violin Plots of Most Significant Sex Differences ====
# Identify top 5 metabolites by raw p-value
top5_Combined_RPNPF <- ttest_results_Combined_RPNPF_fdr %>%
arrange(raw_p) %>%
slice_head(n = 5) %>%
pull(compound_name)
p_violin_RPNPF <- ggplot(
RPNPF_transformed %>% filter(compound_name %in% top5_Combined_RPNPF),
aes(x = sample_group, y = intensity_log2, fill = sample_group)
) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_jitter(width = 0.2, size = 1, alpha = 0.5) +
facet_wrap(~ compound_name, scales = "free_y") +
scale_fill_manual(values = c("JMML" = "#E74C3C", "Control" = "#3498DB")) +
labs(
title = "Top 5 JMML-associated Metabolites",
subtitle = "T-test on Imputed Intensities (RPNPF Platform)",
x = "Case Status",
y = "log2(Intensity)",
fill = ""
) +
theme_bw(base_size = 14) +
theme(
strip.text = element_text(size = 10, face = "bold"),
legend.position = "right"
)
p_violin_RPNPF
# Save violin plot
ggsave(
filename = "../res/annotated/RPNPF_top5_ttest.png",
plot = p_violin_RPNPF,
width = 10, height = 6, units = "in", dpi = 300
)
# ==== Raw-p value Visualizations ==============================================
# ==== Viz Fish RPNPF P-value ===================================================
fish_RPNPF <- final_results %>%
filter(test_type == "Fisher") %>%
filter(platform == "RPNPF")
# Add enrichment direction labels
fish_plot_RPNPF_p <- fish_RPNPF %>%
mutate(
direction = case_when(
is.na(log2_OR) ~ "Ambiguous",
log2_OR >= 1 ~ "Detected more in JMML",
log2_OR <= -1 ~ "Detected more in Control",
abs(log2_OR) < 1 ~ "Ambiguous"
),
neglog10_p = -log10(raw_p),
sig = ifelse(raw_p < 0.05, "Significant", "Not Significant")) %>%
mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
# Drop columns that are entirely NA
fish_plot_RPNPF_p_clean <- fish_plot_RPNPF_p %>%
select(where(~ !all(is.na(.)))) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
write.csv(fish_plot_RPNPF_p_clean, file = "../res/annotated/clean_fisher_RPNPF_table.csv")
fish_p_RPNPF <- ggplot(fish_plot_RPNPF_p, aes(x = log2_OR, y = neglog10_p)) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray") +
geom_point(aes(color = direction, shape = sig), alpha = 0.7) +
# geom_text_repel(data = fish_plot_RPNPF_p %>% filter(sig == "Significant"),
# aes(label = compound_name),
# size = 3, max.overlaps = 15) +
scale_color_manual(values = c(
"Detected more in JMML" = "#E74C3C",
"Detected more in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "Fisher’s Exact Test on Missingness (RPNPF Platform)",
caption = paste0("n = ", nrow(fish_plot_RPNPF_p ),
" metabolites tested using Fisher’s Exact Test"),
x = "log2(Odds Ratio: JMML vs. Control)\n↑ More detected in JMML, ↓ More detected in Control",
y = "-log10(P-value)",
color = "Relative Detection",
shape = "Statistically Significant\n(P-value < 0.05)"
)+
theme_bw(base_size = 14)
print(fish_p_RPNPF)
ggsave(filename = "../res/annotated/fisher_volcano_RPNPF_pval.png", plot = fish_p_RPNPF)
# ==== RPNPF - TTest Visualizations FDR ==========================================
ttest_RPNPF <- final_results %>%
filter(test_type == "TTest") %>%
filter(platform == "RPNPF")
# ==== Volcano Plot of Sex Differences ====
ttest_results_Combined_RPNPF_p <- ttest_RPNPF %>%
mutate(
direction = case_when(
log2_fc > 0 ~ "Higher in JMML",
log2_fc < 0 ~ "Higher in Control",
TRUE ~ "Ambiguous"
),
sig = ifelse(raw_p < 0.05, "Significant", "Not Significant")
) %>% mutate(sig = factor(sig, levels = c("Not Significant", "Significant")))
RPNPF_ttest_volcano_pval <- ggplot(
ttest_results_Combined_RPNPF_p ,
aes(x = log2_fc, y = -log10(raw_p))
) +
geom_point(aes(color = direction, shape = sig), alpha = 0.7, size = 1.5) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray50") +
scale_color_manual(values = c(
"Higher in JMML" = "#E74C3C",
"Higher in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) + # ● = not sig, ▲ = sig
labs(
title = "JMML-based Metabolite Differences",
subtitle = "T-test on Imputed Box-Cox Transformed Intensities\n(RPNPF Platform)",
caption = paste0("n = ", nrow(ttest_results_Combined_RPNPF_p), " metabolites tested via two-sided t-test."),
x = "log2FC(JMML / Control)\n↑ Higher in JMML, ↓ Higher in Control",
y = "-log10(p-value)\nfrom t-test",
color = "Relative Intensity",
shape = "Statistically Significant\n(p-value < 0.05)"
) +
theme_bw(base_size = 14)
print(RPNPF_ttest_volcano_pval)
# Save volcano plot
ggsave("../res/annotated/RPNPF_ttest_volcano_pval.png",
RPNPF_ttest_volcano_pval, width = 10, height = 6, dpi = 300)
# ==== Significant Hits ========================================================
# ==== Viz ttest sig hit FDR 2-hydro ===========================================
t_res <- final_results %>% filter(test_type == "TTest") %>% filter(global_fdr < .05)
ZHP_transformed <- readRDS("../data/R_objects/ZHP_imputed.rds")
ttest_sig <- ggplot(ZHP_transformed %>% filter(compound_name == "2-hydroxychrysene"),
aes(x = sample_group, y = intensity_log2, fill = sample_group)) +
geom_violin(trim = FALSE, scale = "width") +
geom_jitter(width = 0.15, size = 1.5, alpha = 0.6, shape = 21, stroke = 0.2) +
stat_summary(fun = mean, geom = "point", shape = 21, size = 2, fill = "white") +
scale_fill_manual(values = c("Control" = "#3498DB", "JMML" = "#E74C3C")) +
labs(
title = "",
x = "",
y = "log2(Intensity)",
fill = ""
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "bottom",
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold")
)
print(ttest_sig)
#Distribution of log2(intensity) of 2-hydroxychrysene
ggsave(filename = "../res/annotated/ttest_violin_2-hydroxychrysene.png", plot = ttest_sig)
# ==== Viz violin sig hit Fisher ===========================================
fish_res_top <- final_results %>%
filter(test_type == "Fisher") %>%
filter(global_fdr < .05) %>%
pull(compound_name)
# Load the non-imputed intensities (M2)
ZHP_filtered_fish <- readRDS("../data/R_objects/ZHP_annotated_filtered.rds") %>%
filter(sample_group %in% c("JMML", "Control")) %>%
filter(compound_name %in% fish_res_top) %>%
mutate(intensity_log2 = log2(intensity + 1))
writexl::write_xlsx(ZHP_filtered_fish, path = "../res/annotated/Fisher_FDR_hits.xlsx")
fish_res_violin_ZHP <- ggplot(
ZHP_filtered_fish,
aes(x = sample_group, y = intensity_log2, fill = sample_group)
) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_jitter(width = 0.2, size = 1, alpha = 0.5) +
facet_wrap(~ compound_name, scales = "free_y") +
scale_fill_manual(values = c("JMML" = "#E74C3C", "Control" = "#3498DB")) +
labs(
title = "Top 4 JMML-associated Metabolites ",
subtitle = "Fisher's Exact Test on Missingness (ZHP Platform)",
x = "Case Status",
y = "log2(Intensity)",
fill = ""
) +
theme_bw(base_size = 14) +
theme(
strip.text = element_text(size = 10, face = "bold"),
legend.position = "right"
)
fish_res_violin_ZHP
# Save violin plot
ggsave(
filename = "../res/annotated/ZHP_top4violin_fish_res.png",
plot = fish_res_violin_ZHP,
width = 10, height = 6, units = "in", dpi = 300
)
# ==== Lilly Poster ==========================================
set.seed(42)
# build the plot
fish_p_ZHP_fdr_pub <- ggplot(fish_plot_ZHP_fdr, aes(x = log2_OR, y = neglog10_fdr)) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray80") +
geom_vline(xintercept = c(-1, 1), linetype = "dotted", color = "gray80") +
geom_point(aes(color = direction, shape = sig), alpha = 0.7) +
geom_text_repel(
data = filter(fish_plot_ZHP_fdr, sig == "Significant"),
aes(label = compound_name),
size = 3,
box.padding = 0.6,
point.padding = 0.4,
force = 2,
max.overlaps = Inf
) +
scale_color_manual(values = c(
"Detected more in JMML" = "#E74C3C",
"Detected more in Control" = "#3498DB",
"Ambiguous" = "gray50"
)) +
scale_shape_manual(values = c("Not Significant" = 16, "Significant" = 17)) +
labs(
x = expression(Log[2](OR)~"(JMML/Control)"),
y = expression(-Log[10](FDR)),
color = "Relative Detection",
shape = "Statistically Significant\n(FDR < 0.05)"
) +
theme_classic(base_size = 14, base_family = "Times") +
theme(
axis.title.x = element_text(face = "bold",
size = 14,
margin = ggplot2::margin(t = 8)),
axis.title.y = element_text(face = "bold",
size = 14,
margin = ggplot2::margin(r = 8)),
axis.text = element_text(size = 12),
legend.key.size = unit(0.8, "lines"),
legend.position = "bottom",
legend.box = "vertical",
legend.title = element_text(face = "bold"),
legend.text = element_text(size = 11),
legend.margin = margin(t = 5)
) +
guides(
color = guide_legend(title.position = "top", nrow = 1),
shape = guide_legend(title.position = "top", nrow = 1)
)
#print(fish_p_ZHP_fdr_pub)
ggsave(filename = "../res/annotated/fisher_volcano_ZHP_FDR_pub.png", plot = fish_p_ZHP_fdr_pub)After plugging in the above compounds into MetaboAnalyst (https://dev.metaboanalyst.ca/MetaboAnalyst/upload/PathUploadView.xhtml):
We are going to repeat the same workflow above for a sex comparison.
We stratified our analysis into three groups to understand how sex influences metabolite profiles:
Combined (JMML + Control): Looks at sex differences across all samples to identify overall trends.
Control-only: Focuses on healthy individuals to capture baseline sex effects in the absence of JMML.
JMML-only: Examines whether sex differences are specific to JMML.
By comparing across these groups, we can see whether observed sex effects are disease-specific or reflect more general biological differences.
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y,
## list(as.integer(seq_along(x$x)))): semi-transparency is not supported on this
## device: reported only once per page
## $Sex_Combined
## [1] "N-Acetyl-S-(2-hydroxy-3-butenyl)-L-cysteine ↑ Females"
## [2] "N-Acetylcysteine ↑ Females"
## [3] "1-Methylxanthine/3-Methylxanthine/7-Methylxanthine ↓ Females"
## [4] "Carnitine ↓ Females"
## [5] "Guanidinoacetic acid ↓ Females"
## [6] "Methylthioadenosine ↓ Females"
## [7] "N-Acetylalanine ↓ Females"
## [8] "Octanoylcarnitine ↓ Females"
## [9] "Pantothenic acid ↓ Females"
## [10] "2-Ethylhydracrylic acid ↑ Females"
## [11] "Propylparaben ↓ Females"
## [12] "Glycolithocholic acid/ Lithocholic acid glycine conjugate ↑ Females"
## [13] "Mannitol/Sorbitol/ Galactitol ↓ Females"
## [14] "Monomethylglutarate ↑ Females"
##
## $Sex_JMML
## [1] "N-Acetylcysteine ↑ Females"
## [2] "Carnitine ↓ Females"
## [3] "10-Hydroxydecanoic acid ↓ Females"
## [4] "2-Ethylhydracrylic acid ↑ Females"
## [5] "Propylparaben ↓ Females"
## [6] "Citrate/Isocitrate ↓ Females"
## [7] "Glycolithocholic acid/ Lithocholic acid glycine conjugate ↑ Females"
## [8] "POV-PC ↑ Females"
##
## $Sex_Control
## [1] "1-Methylxanthine/3-Methylxanthine/7-Methylxanthine ↓ Females"
## [2] "N-Acetylalanine ↓ Females"
## [3] "Tryptophan ↑ Females"
## [4] "Sphinganine-1-phosphate (d18:0) ↑ Females"
Consistency of sex-associated hits between JMML-only and Control-only.
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
We’re applying essentially the same pipeline as above - only now to compare JMML samples by their somatic‐mutation status rather than by sex. In brief:
Cohort: We restrict to JMML samples and split them by Somatic_Neg versus Somatic_Pos (i.e. absence or presence of a DBS‐detected somatic mutation)
Note - we set aside the two JMML samples labeled with ‘unknown’ somatic status.
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size):
## semi-transparency is not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
## Warning in grid.Call.graphics(C_polygon, x$x, x$y,
## list(as.integer(seq_along(x$x)))): semi-transparency is not supported on this
## device: reported only once per page
## $Sex_Combined
## [1] "N-Acetyl-S-(2-hydroxy-3-butenyl)-L-cysteine ↑ Females"
## [2] "N-Acetylcysteine ↑ Females"
## [3] "1-Methylxanthine/3-Methylxanthine/7-Methylxanthine ↓ Females"
## [4] "Carnitine ↓ Females"
## [5] "Guanidinoacetic acid ↓ Females"
## [6] "Methylthioadenosine ↓ Females"
## [7] "N-Acetylalanine ↓ Females"
## [8] "Octanoylcarnitine ↓ Females"
## [9] "Pantothenic acid ↓ Females"
## [10] "2-Ethylhydracrylic acid ↑ Females"
## [11] "Propylparaben ↓ Females"
## [12] "Glycolithocholic acid/ Lithocholic acid glycine conjugate ↑ Females"
## [13] "Mannitol/Sorbitol/ Galactitol ↓ Females"
## [14] "Monomethylglutarate ↑ Females"
##
## $Somatic
## [1] "Phenol ↑ Somatic+"
## [2] "16-Hydroxy hexadecanoic acid ↑ Somatic+"
## [3] "2-Aminoisobutyric acid ↑ Somatic+"
## [4] "5-Hydroxylysine ↑ Somatic+"
## [5] "Caffeine ↑ Somatic+"
## [6] "Kynurenine ↑ Somatic+"
## [7] "N-Acety-L-Leucine ↑ Somatic+"
## [8] "N-Alpha-Acetyllysine ↑ Somatic+"
## [9] "2-Hydroxymyristic acid ↓ Somatic+"
## [10] "12-Methyltetradecanoic acid/ 13-Methylmyristic acid/ Pentadecanoic acid ↓ Somatic+"
## [11] "3-Hydroxyhexadecanoic acid ↓ Somatic+"
## [12] "4-Hydroxybenzaldehyde ↓ Somatic+"
## [13] "9,10-EpOME/ 12,13-EpOME ↑ Somatic+"
## [14] "Dehydroepiandrosterone sulfate (DHEAS) ↓ Somatic+"
## [15] "Gamma-Linolenate/ 11-Octadecen-9-ynoic acid ↑ Somatic+"
##
## $JMML_vs_Control
## [1] "2-hydroxychrysene ↑ JMML"
## [2] "3-Oxindole-3-acetate ↓ JMML"
## [3] "Arachidonoyl PAF C-16 ↑ JMML"
## [4] "Biliverdin ↑ JMML"
## [5] "Decanoylcarnitine ↓ JMML"
## [6] "Kynurenine ↑ JMML"
## [7] "Lauroylcarnitine ↓ JMML"
## [8] "N-Acetylalanine ↓ JMML"
## [9] "Xanthine ↓ JMML"
## [10] "(R)-3-Hydroxydecanoic acid ↓ JMML"
## [11] "13SHpODE / 9SHpODE ↓ JMML"
## [12] "18:2 (Cis) PC (1,2-dilinoleoyl-sn-glycero-3-phosphocholine_DLPC) ↓ JMML"
## [13] "2-Hydroxybutyrate/ 3-Hydroxybutanoate/ Alpha-Hydroxyisobutyrate ↓ JMML"
## [14] "2-Hydroxystearate ↓ JMML"
## [15] "3-Dehydroshikimate ↓ JMML"
## [16] "3-Hydroxyoctanoic acid ↓ JMML"
## [17] "5-Dodecenoic acid ↓ JMML"
## [18] "9-Decenoic acid ↓ JMML"
## [19] "Docosapentaenoic acid ↓ JMML"
## [20] "Elaidate/ Oleate/ Petroselinate/ 11-Octadecenoic acid ↓ JMML"
## [21] "Glycolithocholic acid/ Lithocholic acid glycine conjugate ↓ JMML"
## [22] "Hexadecanedioic acid ↓ JMML"
## [23] "POV-PC ↓ JMML"
## [24] "Palmitoylcarnitine ↓ JMML"
## [25] "Ribitol/Xylitol/ Arabitol ↓ JMML"
## [26] "2-Oxobutanoic acid ↑ JMML"
## [27] "3,7-Dimethyluric acid ↓ JMML"
## [28] "3-(4-Hydroxyphenyl)Pyruvic acid ↑ JMML"
## [29] "Bisphenol AP ↑ JMML"
## [30] "Caffeic acid ↑ JMML"
## [31] "Diaminopimelic acid ↓ JMML"
## [32] "Dodec-2-enedioic acid ↓ JMML"
## [33] "Sebacic acid ↓ JMML"
## [1] "/N/slate/stevbroo/project_metabolite_JMML/analysis"
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.1 (2024-06-14)
## os Red Hat Enterprise Linux 8.10 (Ootpa)
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2025-10-06
## pandoc 3.4 @ /geode2/soft/hps/rhel8/rstudio/2025.04/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## backports 1.5.0 2024-05-23 [2] CRAN (R 4.4.1)
## Biobase * 2.64.0 2024-04-30 [2] Bioconductor 3.19 (R 4.4.1)
## BiocGenerics * 0.50.0 2024-04-30 [2] Bioconductor 3.19 (R 4.4.1)
## bit 4.0.5 2022-11-15 [2] CRAN (R 4.4.1)
## bit64 4.0.5 2020-08-30 [2] CRAN (R 4.4.1)
## broom * 1.0.6 2024-05-17 [2] CRAN (R 4.4.1)
## bslib 0.7.0 2024-03-29 [2] CRAN (R 4.4.1)
## cachem 1.1.0 2024-05-16 [2] CRAN (R 4.4.1)
## cellranger 1.1.0 2016-07-27 [2] CRAN (R 4.4.1)
## cli 3.6.5 2025-04-23 [1] CRAN (R 4.4.1)
## colorspace 2.1-0 2023-01-23 [2] CRAN (R 4.4.1)
## crayon 1.5.3 2024-06-20 [2] CRAN (R 4.4.1)
## crosstalk 1.2.1 2023-11-23 [2] CRAN (R 4.4.1)
## devtools * 2.4.5 2022-10-11 [2] CRAN (R 4.4.1)
## digest 0.6.36 2024-06-23 [2] CRAN (R 4.4.1)
## dplyr * 1.1.4 2023-11-17 [2] CRAN (R 4.4.1)
## DT * 0.34.0 2025-09-02 [1] CRAN (R 4.4.1)
## ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.4.1)
## evaluate 0.24.0 2024-06-10 [2] CRAN (R 4.4.1)
## farver 2.1.2 2024-05-13 [2] CRAN (R 4.4.1)
## fastmap 1.2.0 2024-05-15 [2] CRAN (R 4.4.1)
## forcats * 1.0.0 2023-01-29 [2] CRAN (R 4.4.1)
## fs 1.6.4 2024-04-25 [2] CRAN (R 4.4.1)
## generics 0.1.4 2025-05-09 [1] CRAN (R 4.4.1)
## ggplot2 * 3.5.2 2025-04-09 [2] CRAN (R 4.4.1)
## ggrepel * 0.9.5 2024-01-10 [2] CRAN (R 4.4.1)
## ggVennDiagram * 1.5.4 2025-06-21 [1] CRAN (R 4.4.1)
## glue * 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)
## gmm * 1.9-1 2025-08-26 [1] CRAN (R 4.4.1)
## gtable 0.3.5 2024-04-22 [2] CRAN (R 4.4.1)
## highr 0.11 2024-05-26 [2] CRAN (R 4.4.1)
## hms 1.1.3 2023-03-21 [2] CRAN (R 4.4.1)
## htmltools * 0.5.8.1 2024-04-04 [2] CRAN (R 4.4.1)
## htmlwidgets 1.6.4 2023-12-06 [2] CRAN (R 4.4.1)
## httpuv 1.6.15 2024-03-26 [2] CRAN (R 4.4.1)
## impute * 1.78.0 2024-04-30 [2] Bioconductor 3.19 (R 4.4.1)
## imputeLCMD * 2.1 2022-06-10 [1] CRAN (R 4.4.1)
## janitor * 2.2.0 2023-02-02 [2] CRAN (R 4.4.1)
## jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.4.1)
## jsonlite 1.8.8 2023-12-04 [2] CRAN (R 4.4.1)
## kableExtra * 1.4.0 2024-01-24 [1] CRAN (R 4.4.1)
## knitr * 1.47 2024-05-29 [2] CRAN (R 4.4.1)
## labeling 0.4.3 2023-08-29 [2] CRAN (R 4.4.1)
## later 1.3.2 2023-12-06 [2] CRAN (R 4.4.1)
## lattice 0.22-6 2024-03-20 [2] CRAN (R 4.4.1)
## lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.4.1)
## lubridate * 1.9.4 2024-12-08 [2] CRAN (R 4.4.1)
## magrittr 2.0.4 2025-09-12 [1] CRAN (R 4.4.1)
## MASS 7.3-61 2024-06-13 [2] CRAN (R 4.4.1)
## Matrix * 1.7-0 2024-04-26 [2] CRAN (R 4.4.1)
## memoise 2.0.1 2021-11-26 [2] CRAN (R 4.4.1)
## mime 0.12 2021-09-28 [2] CRAN (R 4.4.1)
## miniUI 0.1.1.1 2018-05-18 [2] CRAN (R 4.4.1)
## munsell 0.5.1 2024-04-01 [2] CRAN (R 4.4.1)
## mvtnorm * 1.2-5 2024-05-21 [2] CRAN (R 4.4.1)
## naniar * 1.1.0 2024-03-05 [1] CRAN (R 4.4.1)
## norm * 1.0-11.1 2023-06-18 [2] CRAN (R 4.4.1)
## pcaMethods * 1.96.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
## pillar 1.11.1 2025-09-17 [1] CRAN (R 4.4.1)
## pkgbuild 1.4.4 2024-03-17 [2] CRAN (R 4.4.1)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.4.1)
## pkgload 1.4.0 2024-06-28 [2] CRAN (R 4.4.1)
## png 0.1-8 2022-11-29 [2] CRAN (R 4.4.1)
## profvis 0.4.0 2024-09-20 [2] CRAN (R 4.4.1)
## promises 1.3.0 2024-04-05 [2] CRAN (R 4.4.1)
## purrr * 1.1.0 2025-07-10 [1] CRAN (R 4.4.1)
## R6 2.6.1 2025-02-15 [1] CRAN (R 4.4.1)
## ragg 1.3.2 2024-05-15 [2] CRAN (R 4.4.1)
## Rcpp 1.1.0 2025-07-02 [1] CRAN (R 4.4.1)
## readr * 2.1.5 2024-01-10 [2] CRAN (R 4.4.1)
## readxl * 1.4.3 2023-07-06 [2] CRAN (R 4.4.1)
## remotes 2.5.0 2024-03-17 [2] CRAN (R 4.4.1)
## rlang 1.1.6 2025-04-11 [1] CRAN (R 4.4.1)
## rmarkdown 2.27 2024-05-17 [2] CRAN (R 4.4.1)
## rstudioapi 0.16.0 2024-03-24 [2] CRAN (R 4.4.1)
## sandwich * 3.1-0 2023-12-11 [2] CRAN (R 4.4.1)
## sass 0.4.9 2024-03-15 [2] CRAN (R 4.4.1)
## scales 1.3.0 2023-11-28 [2] CRAN (R 4.4.1)
## sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.4.1)
## shiny 1.8.1.1 2024-04-02 [2] CRAN (R 4.4.1)
## snakecase 0.11.1 2023-08-27 [2] CRAN (R 4.4.1)
## stringi 1.8.4 2024-05-06 [2] CRAN (R 4.4.1)
## stringr * 1.5.2 2025-09-08 [1] CRAN (R 4.4.1)
## svglite 2.1.3 2023-12-08 [2] CRAN (R 4.4.1)
## systemfonts 1.1.0 2024-05-15 [2] CRAN (R 4.4.1)
## textshaping 0.4.0 2024-05-24 [2] CRAN (R 4.4.1)
## tibble * 3.3.0 2025-06-08 [1] CRAN (R 4.4.1)
## tidyr * 1.3.1 2024-01-24 [2] CRAN (R 4.4.1)
## tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.4.1)
## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.1)
## timechange 0.3.0 2024-01-18 [2] CRAN (R 4.4.1)
## tmvtnorm * 1.7 2025-09-01 [1] CRAN (R 4.4.1)
## tzdb 0.4.0 2023-05-12 [2] CRAN (R 4.4.1)
## urlchecker 1.0.1 2021-11-30 [2] CRAN (R 4.4.1)
## usethis * 3.0.0 2024-07-29 [2] CRAN (R 4.4.1)
## utf8 1.2.6 2025-06-08 [1] CRAN (R 4.4.1)
## vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.4.1)
## viridisLite 0.4.2 2023-05-02 [2] CRAN (R 4.4.1)
## visdat 0.6.0 2023-02-02 [1] CRAN (R 4.4.1)
## vroom 1.6.5 2023-12-05 [2] CRAN (R 4.4.1)
## withr 3.0.2 2024-10-28 [1] CRAN (R 4.4.1)
## writexl * 1.5.4 2025-04-15 [1] CRAN (R 4.4.1)
## xfun 0.45 2024-06-16 [2] CRAN (R 4.4.1)
## xml2 1.3.6 2023-12-04 [2] CRAN (R 4.4.1)
## xtable 1.8-4 2019-04-21 [2] CRAN (R 4.4.1)
## yaml 2.3.8 2023-12-11 [2] CRAN (R 4.4.1)
## zoo 1.8-12 2023-04-13 [2] CRAN (R 4.4.1)
##
## [1] /geode2/home/u090/stevbroo/Quartz/R/x86_64-pc-linux-gnu-library/4.4
## [2] /geode2/soft/hps/rhel8/r/gnu/4.4.1/lib64/R/library
##
## ──────────────────────────────────────────────────────────────────────────────